Methodology

How we build trustworthy financial data

111M+ SEC facts across 19,000+ entities — point-in-time accurate, survivorship-free, and auditable back to its source filing. Here is exactly how each guarantee is built.

Bloomberg, WRDS, and Compustat make claims about point-in-time accuracy and survivorship-bias-free coverage. We document exactly how ours work — from coverage and survivorship to concept standardization, amendment handling, accepted_at semantics, the smart-money dataset, and the validation checks that run on every release. Then we show you how to audit any of it yourself.

Coverage & scope Survivorship-free Concept standardization Point-in-time Amendments Provenance & lineage Quarterly derivation Index membership Smart-money Foreign issuers Measured accuracy Anti-hallucination contract Output variance Auth & injection Delivery Verify it yourself

Why XBRL is hard

The SEC has required XBRL submissions since 2009. The format is machine-readable, but standardization stops there. Each filer picks their own taxonomy: us-gaap, ifrs-full, or a custom extension. A single concept like “revenue” resolves to a dozen possible XBRL tags depending on the company, the year, and whether ASC 606 had been adopted.

Restatements are not corrections — they are new filings with the same fiscal period but different values. Quarterly cash flow statements report year-to-date totals, not quarter-only figures. Foreign filers use 20-F and 40-F instead of 10-K with subtly different concept names. And the companies that went bankrupt stopped filing — so a naïve dataset quietly forgets they ever existed.

Anyone can parse XBRL. Producing a dataset where SELECT revenue FROM fact WHERE ticker = 'AAPL' returns the same values 30 years apart, and where a backtest sees the world exactly as it looked on its as-of date — that's the work.

Coverage & scope

The dataset spans the full SEC EDGAR universe of XBRL filers from 1993–present: every active company and every company that has since delisted, gone bankrupt, or been acquired. It is organized as two datasets — a 111M+-fact fundamentals core and a 78M+-row smart-money dataset — across 20 Parquet tables (14 core + 6 smart-money, schema v2.21.0). The core includes daily OHLCV price history (stock_price_daily, with an adjusted close and the raw corporate-action factors) so point-in-time valuation multiples and price returns compute without a second vendor.

111M+

Financial facts

19,000+

Entities (active + delisted)

1993–present

History

78M+

Smart-money rows

292

Standardized concepts

Parquet tables

Filing forms covered

10-K

Annual report

10-Q

Quarterly report

8-K

Material events

20-F

Foreign annual report

40-F

Canadian annual report

/A

Amendments to any of the above

The fundamentals core is sourced from the SEC's quarterly EDGAR Financial Statements Data Sets plus the per-filing XBRL submissions — the same primary source the Commission publishes. Nothing is scraped from third-party aggregators. See the full dataset page for the pipeline and per-table schema.

Survivorship-free by construction

Survivorship bias is the single most expensive hidden flaw in a backtest. Most datasets only keep the companies that are still trading today — so your strategy is silently tested against a universe that already knows which companies survived. The Enrons, the Lehmans, the RadioShacks vanish, and historical returns inflate by a percentage point or two that evaporates the moment you trade live.

Valuein retains every entity that ever filed XBRL financial statements — delisted, bankrupt, acquired, merged — with its complete filing history through its final SEC filing. Roughly half of the 19,000+ entities in the universe are no longer actively trading. They stay in the dataset; they are simply not present on dates after they stopped filing, which is exactly how they would have appeared in real time.

A few of the failures still in the data

Enron Corp

Lehman Brothers

RadioShack

Toys R Us

Blockbuster

WorldCom

Bear Stearns

Washington Mutual

Sears Holdings

Each with complete financial statements through its final filing — so a strategy that would have bought them is held accountable for what happened next.

Point-in-time and survivorship-free are two of the trust guarantees we make. See the trust & security overview for provenance, zero-retention, and reliability — this page covers the data construction underneath them.

Concept standardization

We map 11,966 raw XBRL tags to 292 canonical concepts. Definitions are versioned in a taxonomy_guide table that ships with every Parquet bucket — so you can audit every transformation we apply, and unmapped tags fall through to a labelled Other rather than being silently dropped.

Worked example: Revenue

Source XBRL tag	Used by	Note
us-gaap:Revenues	Apple, Microsoft	Most common
us-gaap:RevenueFromContractWithCustomerExcludingAssessedTax	Tesla, Walmart	Post-ASC 606 adoption
us-gaap:SalesRevenueNet	Pre-2018 filers	Legacy tag, deprecated
us-gaap:RevenueFromContractWithCustomerIncludingAssessedTax	Some retailers	Includes sales tax pass-through
msft:Revenues	Microsoft (custom extension)	Custom XBRL extension

All five resolve to standard_concept = 'Revenues'. Every fact also keeps its raw concept column — the exact XBRL tag the company filed — so you can always trace a standardized value back to source. This 5-row illustration is the level of detail we publish; the canonical name and definition of every concept lives in the data catalog.

Point-in-time, not point-in-hindsight

The most common look-ahead bias in financial data isn't malicious — it's using the wrong date column. Three timestamps live on every fact, and they mean different things.

report_dateWhen the period ended

e.g. 2024-09-28 (Apple FY2024)

Aligns financials to a fiscal calendar. Never use as a PIT cutoff — companies file weeks or months later.

filing_dateWhen the filing was submitted

e.g. 2024-11-01

Useful for filing-cadence analysis. Still not PIT-safe — filings can be accepted hours after the date stamp.

accepted_atWhen SEC accepted it (the canonical PIT field)

e.g. 2024-11-01T06:01:36Z

The exact moment the data became public. Use this — and only this — for backtests and any look-ahead-free analysis.

Every PIT-safe MCP tool and SDK method accepts an as_of_date parameter. Internally, that filters on accepted_at <= as_of_date — the queryable equivalent of “what did the market know on this date?”

The discipline extends to derived data. The ratio table is append-on-restatement: each ratio row carries its own accepted_at vintage (the latest acceptance timestamp of its input facts), and a restatement appends a new vintage instead of overwriting the old one. Filter accepted_at <= as_of_date and take the latest vintage per period — a backtest sees the ratio as it was knowable, never as it was later revised.

Amendments and restatements

When a company files a 10-K/A, the SEC treats it as a new filing — not an overwrite. Most data vendors collapse the amendment over the original, destroying the historical view. We keep both.

Two rows, one fiscal periodtext

-- Apple FY2018 net income, original filing
ticker     fiscal_year  standard_concept  numeric_value  accepted_at
AAPL       2018         NetIncome         59531000000    2018-11-05T18:23:00Z

-- Apple FY2018 net income, after restatement (hypothetical)
ticker     fiscal_year  standard_concept  numeric_value  accepted_at
AAPL       2018         NetIncome         59300000000    2019-02-12T14:51:00Z

A backtest that ran on 2018-12-01 sees the first row only — the original $59.531B. A current dashboard sees the latest accepted value. Both are correct; both are queryable. The PIT discipline is what guarantees you get the right one.

Provenance: every number has an identity

A figure is only trustworthy if you can prove where it came from. Every fact in the dataset carries a deterministic fact_id — a SHA-256 hash of the entity, the SEC accession, the concept, the period end, and the unit:

How a fact_id is derivedpython

# Deterministic in the pipeline, recomputable in the SDK.fact_id = sha256(    f"{entity_id}|{accession_id}|{concept}|{period_end}|{unit}").hexdigest() # The SDK exposes the same function — same inputs, same id, anywhere.from valuein_sdk import compute_fact_idcompute_fact_id(entity_id, accession_id, concept, period_end, unit)

Because the hash is deterministic, the fact_id is identical in the Parquet files, the Python SDK, and the MCP server — there is no separate provenance database to drift out of sync; the identity travels with the value. Pass any fact_id to verify_fact_lineage and it resolves back to its source XBRL tag, accession ID, and filing URL for one-click verification against EDGAR.

The envelope on every fact

fact_id

A deterministic SHA-256 hash of entity_id, accession_id, concept, period_end, and unit. The same fact computes the same id everywhere — Parquet, SDK, and MCP agree byte-for-byte, so a figure can be re-derived and re-located, never mistaken for another.

confidence_score

How directly the value came from the filing — a clean mapped tag scores higher than one recovered through a fallback rule. Lets you threshold on certainty rather than trusting every row equally.

reliability_code

A 1–4 grade of the standardization path, from a primary canonical mapping down to a best-effort fallback. A single integer you can filter or weight on.

restatement_count

How many times this fiscal period has been re-filed. A non-zero count is an immediate signal that the as-reported and current values diverge — and that you should pick the right one for your as-of date.

This envelope is the foundation for everything below: the model never has to be trusted with a number, because the number arrives already attributable, gradeable, and verifiable.

Quarterly cash flow derivation

In Q2 and Q3 10-Q filings, US GAAP requires cash flow statements to report year-to-date totals. Computing a clean quarterly time series requires subtracting the prior quarter — every time, for every issuer, for every line item.

Example: operating cash flowtext

Period        numeric_value (YTD)   derived_quarterly_value
Q1 2024       12.0B                 12.0B
Q2 2024       28.0B                 16.0B   ← 28.0 − 12.0
Q3 2024       45.0B                 17.0B   ← 45.0 − 28.0
Q4 2024       62.0B                 17.0B   ← 62.0 − 45.0

Both columns ship in every Parquet bucket. Use COALESCE(derived_quarterly_value, numeric_value) when you want a true quarterly time series; use numeric_value when you specifically want the as-reported YTD figure.

Point-in-time index membership

“What was in the S&P 500 on March 1, 2014?” is a survivorship trap in disguise: screen the index by its current members and you have quietly excluded everyone who was dropped. Index membership is therefore tracked the same way facts are — historically, with effective and removal dates.

The index_membership table records membership spells for S&P 500 and Russell 1000 / 2000 / 3000 with an effective_date and a removal_date per spell, using half-open [effective, removal) interval semantics. A company that left and rejoined gets two spells, not a merged one — so a point-in-time universe on any date reconstructs the index exactly as it stood.

Show the point-in-time query ↓

Members of an index on a given datesql

SELECT r.symbol, r.nameFROM index_membership imJOIN references r ON r.cik = im.cikWHERE im.index_name = 'SP500'  AND '2014-03-01' >= im.effective_date  AND ('2014-03-01' < im.removal_date OR im.removal_date IS NULL);

There is no is_sp500 flag — a single boolean can only describe one index at one moment, which is precisely the snapshot bias we avoid. Membership is always a JOIN on cik.

The smart-money dataset

The second dataset — 78M+ rows across six tables — standardizes who is buying and who is holding. It is built from the SEC's mandatory ownership disclosures and held to the same point-in-time and survivorship guarantees as the fundamentals.

Insider activity

Form 3Form 4Form 5Form 144

Every officer, director, and 10%+ owner transaction — buys, sells, option exercises, and proposed sales — standardized one row per transaction with the transaction code, shares, price, and post-transaction holdings.

insider_transactioninsider_filinginsider_party

Beneficial ownership

SC 13DSC 13G

5%+ activist and passive stakes, one row per reporting person, with the percent owned and the full voting / dispositive-power breakdown.

insider_ownership

Institutional 13F

13F-HR13F-NT

Quarterly position disclosures for every institutional manager — shares, USD market value, put/call, and voting authority — one row per holding, resolvable to the issuer.

institutional_holdinginstitutional_filing

Reporting persons are resolved into a deduplicated directory and 13F holdings are linked back to the issuer they describe, so each row carries a soft reference to entity.cik. The references are soft, not hard, foreign keys — a foreign, pre-IPO, or delisted issuer that doesn't resolve is kept rather than dropped, so coverage is never silently lost. Each disclosure is point-in-time via its own accepted_at. Full table-by-table detail is on the smart-money dataset page.

Foreign private issuers

Foreign private issuers don't file 10-Ks. They file 20-F (and Canadian issuers file 40-F), often under IFRS rather than US GAAP, with their own concept names. Those filings flow through the same standardization pipeline: concepts map into the same canonical standard_concept vocabulary, and an is_foreign flag on the entity lets you include or isolate them. The result is that a US filer and a foreign issuer answer the same query the same way.

Accounting-identity validation

XBRL is machine-readable, not self-consistent: a mis-tagged line item or a transposed figure will parse perfectly and still be wrong. Before any Parquet build is published, the pipeline checks the standardized facts against a catalogue of 35 published accounting identities — the articulation a financial statement is required to obey, drawn from the accounting literature and FASB codification — with a tight numerical tolerance to absorb legitimate rounding.

The measured result

All 19,607 S&P 500 annual filings pass every one of 35 accounting identities — 0 failures. The result is deliberately falsifiable: the baseline is published in our public repo (docs/accuracy/baseline.json), re-derivable from one DuckDB file, and a CI gate blocks the public headline from drifting more than a point from the measurement. A vendor that claims “99%+ accuracy” with no published test is asking for faith; this asks for a re-run.

Balance-sheet identity

Assets = Liabilities + Equity

Current-asset rollup

Current assets ≤ total assets

Cash-flow articulation

Δ cash ≈ operating + investing + financing

Income-statement rollup

Gross profit = revenue − cost of revenue

A statement that fails an identity is recorded in a qa_violation table rather than silently corrected — the discrepancy is visible and traceable, never papered over. Restatements are tracked the same way: a fact_lineage_summary flags a period as materially restated when a re-filed value moves by more than ~0.5% from the original, so a downstream consumer can tell a cosmetic re-tag apart from an economically meaningful correction.

Structural & coverage checks

On top of the accounting identities, the same release gate runs structural checks. Every fact returned by the MCP server includes a _meta.data_quality block listing which of these passed.

Uniqueness & ordering

A company cannot report two FY2024 income statements, and quarterly periods must be strictly ordered. Detects dirty XBRL submissions, amendment collisions, and mis-tagged fiscal periods that would corrupt time-series queries.

Copy-paste error detection

Adjacent periods with statistically improbable identical metrics are flagged as likely filing errors before they reach the dataset.

Amendment lineage

Every restated value must trace back to its original via the accession_id chain. Orphan amendments are quarantined.

Coverage regression alarms

Concept coverage is monitored each release; an unexpected drop flags a pipeline regression before export.

The anti-hallucination contract: the model never mints a number

The defining failure mode of an LLM over financial data is a confidently-stated figure that no filing supports. Valuein attacks the failure at its root: the model is never the source of a number, and never recomputes one. Every figure is born in a typed, deterministic tool response, already carrying its fact_id and source filing, and the agent is instructed to use it exactly. Any fact_id round-trips through verify_fact_lineage back to its filing — so a stated figure is checkable, not believable.

The MCP server's provenance rules are explicit and binding on every tool response:

Use the returned value exactly — never round, restate, recompute, or estimate it.
Never do arithmetic on returned figures. A derived metric (growth, ratio, margin, valuation) must be requested from the tool that returns it pre-computed with its own input provenance.
Cite the source filing or fact_id for every figure stated; verify_fact_lineage resolves it back to the filing.
If a figure's availability is not_reported, not_mapped, suppressed, or error, state that it is unavailable — never substitute a value from prior knowledge.
Distinguish a genuine reported zero from missing data.

The result: a fabricated figure has nowhere to hide. Anything the model states either carries a verifiable fact_id or is explicitly labelled unavailable — there is no third category where a plausible-sounding number can pass as data.

Human-on-the-loop, by construction

The same discipline governs actions, not just numbers. Managed agent runs execute at temperature 0 with a model allow-list and destructive tools stripped. Every tool is risk-classified from its own declared behavior: read-only calls run freely, while mutating or outward-facing actions are staged in an approval ledger — a human approves or rejects, and every decision lands in an immutable audit entry with the fact_ids it touched. Agents do the work; a person stays on the loop for anything with blast radius.

Deterministic output, not lucky sampling

Two analysts asking the same question should get the same answer. LLMs are stochastic by default, so the Workspace pins the controls that introduce variance:

temperature = 0

Every Workspace model call is pinned to temperature 0, removing sampling randomness so the same prompt against the same data reproduces the same response.

pinned model snapshot

Models such as gpt-4o are pinned to a dated snapshot rather than a floating alias, so a silent upstream model update can't change yesterday's output.

Crucially, the figures in an answer don't depend on sampling at all — they carry a fact_id and come straight from the data layer. Temperature and the model snapshot only shape the prose around numbers that are already fixed; the numbers themselves are reproducible regardless of which model, or which run, produced the narrative.

Authorization & prompt-injection safeguards

An agentic data service has two attack surfaces a plain API doesn't: untrusted filing text that could carry injected instructions, and a request path that has to authorize before it touches data. Both are handled at the boundary.

Untrusted-text fencing

SEC filing prose is attacker-controllable — a 10-K can contain text engineered to read like an instruction. Before any filing text is handed to a model, it is passed through a wrapUntrusted() fence that marks it as data, not directives, and known injection-style patterns are stripped or neutralized by a set of regex filters. The model reads the filing as evidence, never as a command.

Layered request hardening

Origin & DNS-rebind checks

Cross-origin and rebinding requests are rejected before any handler runs, so the endpoint can only be reached the way it was meant to be.

Bearer token validation

Every call must present a 64-character hex Bearer token, validated against the Cloudflare KV token store and resolved to a plan tier before a single tool executes.

Body-size cap & per-plan rate limiting

Oversized request bodies are refused and call rates are bounded per plan, so neither a runaway agent nor an abusive client can degrade the service.

Per-request server & Zod schemas

Each request gets a fresh, isolated server instance, and every tool argument is parsed through a strict Zod schema — malformed or unexpected input never reaches the data layer.

Tiering is enforced here too: a tool a caller's plan doesn't cover returns a structured featureNotAvailable envelope with an upgrade path — never a partial or silently downgraded result.

Delivery & freshness

Every table is a column-oriented Parquet file with ZSTD compression — built for DuckDB, Polars, and Spark. A manifest.json ships alongside the data with the snapshot date, the last_updated timestamp, and a row count for every table, so any integration can detect fresh data automatically and verify it received the whole dataset.

The fundamentals core refreshes on the SEC's quarterly EDGAR cadence with amendments processed continuously. On the Institutional tier, filings carry an intraday accepted_at — acceptance timestamps at the moment the SEC published, not a date-only floor.

Python SDK

valuein-sdk on PyPI — in-process DuckDB views over the Parquet tables, with point-in-time enforced at query time.

MCP Server

108 typed tools for any MCP-compatible agent — Claude, Copilot, ChatGPT & Cursor. The same standardized facts, no SQL required.

Bulk Data API

Authenticated HTTPS streaming of the raw Parquet partitions for B2B and partner integrations.

Workspace

The browser research environment — chat, theses, watchlists, alerts, and reports, all reading the same core.

All four read from the same standardized core — and a single Stripe-issued token unlocks every one of them at your tier. There is no per-channel divergence in the numbers, because there is only one set of numbers.

Verify it yourself

Every claim on this page is testable from the sample tier — no token, no signup. Pick any S&P500 ticker and inspect the lineage of any fact via verify_fact_lineage:

verify_fact_lineagebash

curl -X POST https://mcp.valuein.biz/mcp \  -H "Content-Type: application/json" \  -d '{    "jsonrpc": "2.0",    "id": 1,    "method": "tools/call",    "params": {      "name": "verify_fact_lineage",      "arguments": {        "ticker": "AAPL",        "concept": "Revenues",        "period_end": "2024-09-28"      }    }  }'

The response chains the standardized value back to its source XBRL tag, the SEC accession ID, and the filing URL. If we changed it, you can see why.

Methodology you can audit, data you can trust.

Every step above ships with the data. Read the docs, query the sample tier, and compare against the SEC filings yourself.

Read the SDK docs Browse the data catalog