Workspace beta is live — BYO-LLM chat wired to 57 SEC tools. Try it free →
ValueinValuein
Methodology

How we build trustworthy financial data

111M+ SEC facts across 19,000+ entities — point-in-time accurate, survivorship-free, and auditable back to its source filing. Here is exactly how each guarantee is built.

Bloomberg, WRDS, and Compustat make claims about point-in-time accuracy and survivorship-bias-free coverage. We document exactly how ours work — from coverage and survivorship to concept standardization, amendment handling, accepted_at semantics, the smart-money dataset, and the validation checks that run on every release. Then we show you how to audit any of it yourself.

Why XBRL is hard

The SEC has required XBRL submissions since 2009. The format is machine-readable, but standardization stops there. Each filer picks their own taxonomy: us-gaap, ifrs-full, or a custom extension. A single concept like “revenue” resolves to a dozen possible XBRL tags depending on the company, the year, and whether ASC 606 had been adopted.

Restatements are not corrections — they are new filings with the same fiscal period but different values. Quarterly cash flow statements report year-to-date totals, not quarter-only figures. Foreign filers use 20-F and 40-F instead of 10-K with subtly different concept names. And the companies that went bankrupt stopped filing — so a naïve dataset quietly forgets they ever existed.

Anyone can parse XBRL. Producing a dataset where SELECT revenue FROM fact WHERE ticker = 'AAPL' returns the same values 30 years apart, and where a backtest sees the world exactly as it looked on its as-of date — that's the work.

Coverage & scope

The dataset spans the full SEC EDGAR universe of XBRL filers from 1993–present: every active company and every company that has since delisted, gone bankrupt, or been acquired. It is organized as two datasets — a 111M+-fact fundamentals core and a 78M+-row smart-money dataset — across 17 Parquet tables.

111M+
Financial facts
19,000+
Entities (active + delisted)
1993–present
History
78M+
Smart-money rows
292
Standardized concepts
17
Parquet tables

Filing forms covered

10-K

Annual report

10-Q

Quarterly report

8-K

Material events

20-F

Foreign annual report

40-F

Canadian annual report

/A

Amendments to any of the above

The fundamentals core is sourced from the SEC's quarterly EDGAR Financial Statements Data Sets plus the per-filing XBRL submissions — the same primary source the Commission publishes. Nothing is scraped from third-party aggregators. See the full dataset page for the pipeline and per-table schema.

Survivorship-free by construction

Survivorship bias is the single most expensive hidden flaw in a backtest. Most datasets only keep the companies that are still trading today — so your strategy is silently tested against a universe that already knows which companies survived. The Enrons, the Lehmans, the RadioShacks vanish, and historical returns inflate by a percentage point or two that evaporates the moment you trade live.

Valuein retains every entity that ever filed XBRL financial statements — delisted, bankrupt, acquired, merged — with its complete filing history through its final SEC filing. Roughly half of the 19,000+ entities in the universe are no longer actively trading. They stay in the dataset; they are simply not present on dates after they stopped filing, which is exactly how they would have appeared in real time.

A few of the failures still in the data

Enron Corp
Lehman Brothers
RadioShack
Toys R Us
Blockbuster
WorldCom
Bear Stearns
Washington Mutual
Sears Holdings

Each with complete financial statements through its final filing — so a strategy that would have bought them is held accountable for what happened next.

Point-in-time and survivorship-free are two of the trust guarantees we make. See the trust & security overview for provenance, zero-retention, and reliability — this page covers the data construction underneath them.

Concept standardization

We map 11,966 raw XBRL tags to 292 canonical concepts. Definitions are versioned in a taxonomy_guide table that ships with every Parquet bucket — so you can audit every transformation we apply, and unmapped tags fall through to a labelled Other rather than being silently dropped.

Worked example: Revenue

Source XBRL tagUsed byNote
us-gaap:RevenuesApple, MicrosoftMost common
us-gaap:RevenueFromContractWithCustomerExcludingAssessedTaxTesla, WalmartPost-ASC 606 adoption
us-gaap:SalesRevenueNetPre-2018 filersLegacy tag, deprecated
us-gaap:RevenueFromContractWithCustomerIncludingAssessedTaxSome retailersIncludes sales tax pass-through
msft:RevenuesMicrosoft (custom extension)Custom XBRL extension

All five resolve to standard_concept = 'TotalRevenue'. Every fact also keeps its raw concept column — the exact XBRL tag the company filed — so you can always trace a standardized value back to source. This 5-row illustration is the level of detail we publish; the canonical name and definition of every concept lives in the data catalog.

Point-in-time, not point-in-hindsight

The most common look-ahead bias in financial data isn't malicious — it's using the wrong date column. Three timestamps live on every fact, and they mean different things.

report_dateWhen the period ended

e.g. 2024-09-28 (Apple FY2024)

Aligns financials to a fiscal calendar. Never use as a PIT cutoff — companies file weeks or months later.

filing_dateWhen the filing was submitted

e.g. 2024-11-01

Useful for filing-cadence analysis. Still not PIT-safe — filings can be accepted hours after the date stamp.

accepted_atWhen SEC accepted it (the canonical PIT field)

e.g. 2024-11-01T06:01:36Z

The exact moment the data became public. Use this — and only this — for backtests and any look-ahead-free analysis.

Every PIT-safe MCP tool and SDK method accepts an as_of_date parameter. Internally, that filters on accepted_at <= as_of_date — the queryable equivalent of “what did the market know on this date?”

Amendments and restatements

When a company files a 10-K/A, the SEC treats it as a new filing — not an overwrite. Most data vendors collapse the amendment over the original, destroying the historical view. We keep both.

Two rows, one fiscal periodtext
-- Apple FY2018 net income, original filing
ticker     fiscal_year  standard_concept  numeric_value  accepted_at
AAPL       2018         NetIncome         59531000000    2018-11-05T18:23:00Z

-- Apple FY2018 net income, after restatement (hypothetical)
ticker     fiscal_year  standard_concept  numeric_value  accepted_at
AAPL       2018         NetIncome         59300000000    2019-02-12T14:51:00Z

A backtest that ran on 2018-12-01 sees the first row only — the original $59.531B. A current dashboard sees the latest accepted value. Both are correct; both are queryable. The PIT discipline is what guarantees you get the right one.

Quarterly cash flow derivation

In Q2 and Q3 10-Q filings, US GAAP requires cash flow statements to report year-to-date totals. Computing a clean quarterly time series requires subtracting the prior quarter — every time, for every issuer, for every line item.

Example: operating cash flowtext
Period        numeric_value (YTD)   derived_quarterly_value
Q1 2024       12.0B                 12.0B
Q2 2024       28.0B                 16.0B   ← 28.0 − 12.0
Q3 2024       45.0B                 17.0B   ← 45.0 − 28.0
Q4 2024       62.0B                 17.0B   ← 62.0 − 45.0

Both columns ship in every Parquet bucket. Use COALESCE(derived_quarterly_value, numeric_value) when you want a true quarterly time series; use numeric_value when you specifically want the as-reported YTD figure.

Point-in-time index membership

“What was in the S&P 500 on March 1, 2014?” is a survivorship trap in disguise: screen the index by its current members and you have quietly excluded everyone who was dropped. Index membership is therefore tracked the same way facts are — historically, with effective and removal dates.

The index_membership table records membership spells for S&P 500 and Russell 1000 / 2000 / 3000 with an effective_date and a removal_date per spell, using half-open [effective, removal) interval semantics. A company that left and rejoined gets two spells, not a merged one — so a point-in-time universe on any date reconstructs the index exactly as it stood.

Show the point-in-time query ↓
Members of an index on a given datesql
SELECT r.symbol, r.name
FROM index_membership im
JOIN references r ON r.cik = im.cik
WHERE im.index_name = 'SP500'
  AND '2014-03-01' >= im.effective_date
  AND ('2014-03-01' < im.removal_date OR im.removal_date IS NULL);

There is no is_sp500 flag — a single boolean can only describe one index at one moment, which is precisely the snapshot bias we avoid. Membership is always a JOIN on cik.

The smart-money dataset

The second dataset — 78M+ rows across six tables — standardizes who is buying and who is holding. It is built from the SEC's mandatory ownership disclosures and held to the same point-in-time and survivorship guarantees as the fundamentals.

Insider activity
Form 3Form 4Form 5Form 144

Every officer, director, and 10%+ owner transaction — buys, sells, option exercises, and proposed sales — standardized one row per transaction with the transaction code, shares, price, and post-transaction holdings.

insider_transactioninsider_filinginsider_party
Beneficial ownership
SC 13DSC 13G

5%+ activist and passive stakes, one row per reporting person, with the percent owned and the full voting / dispositive-power breakdown.

insider_ownership
Institutional 13F
13F-HR13F-NT

Quarterly position disclosures for every institutional manager — shares, USD market value, put/call, and voting authority — one row per holding, resolvable to the issuer.

institutional_holdinginstitutional_filing

Reporting persons are resolved into a deduplicated directory and 13F holdings are linked back to the issuer they describe, so each row carries a soft reference to entity.cik. The references are soft, not hard, foreign keys — a foreign, pre-IPO, or delisted issuer that doesn't resolve is kept rather than dropped, so coverage is never silently lost. Each disclosure is point-in-time via its own accepted_at. Full table-by-table detail is on the smart-money dataset page.

Foreign private issuers

Foreign private issuers don't file 10-Ks. They file 20-F (and Canadian issuers file 40-F), often under IFRS rather than US GAAP, with their own concept names. Those filings flow through the same standardization pipeline: concepts map into the same canonical standard_concept vocabulary, and an is_foreign flag on the entity lets you include or isolate them. The result is that a US filer and a foreign issuer answer the same query the same way.

Validation checks on every release

Every fact returned by the MCP server includes a _meta.data_quality block listing which checks passed. The set runs on every Parquet build before we publish.

Uniqueness & ordering

A company cannot report two FY2024 income statements, and quarterly periods must be strictly ordered. Detects dirty XBRL submissions, amendment collisions, and mis-tagged fiscal periods that would corrupt time-series queries.

Copy-paste error detection

Adjacent periods with statistically improbable identical metrics are flagged as likely filing errors before they reach the dataset.

Amendment lineage

Every restated value must trace back to its original via the accession_id chain. Orphan amendments are quarantined.

Coverage regression alarms

Concept coverage is monitored each release; an unexpected drop flags a pipeline regression before export.

Delivery & freshness

Every table is a column-oriented Parquet file with ZSTD compression — built for DuckDB, Polars, and Spark. A manifest.json ships alongside the data with the snapshot date, the last_updated timestamp, and a row count for every table, so any integration can detect fresh data automatically and verify it received the whole dataset.

The fundamentals core refreshes on the SEC's quarterly EDGAR cadence with amendments processed continuously. On the Institutional tier, filings carry an intraday accepted_at — acceptance timestamps at the moment the SEC published, not a date-only floor.

Python SDK

valuein-sdk on PyPI — in-process DuckDB views over the Parquet tables, with point-in-time enforced at query time.

MCP Server

57 typed tools for any MCP-compatible agent (Claude, Cursor, Codex). The same standardized facts, no SQL required.

Bulk Data API

Authenticated HTTPS streaming of the raw Parquet partitions for B2B and partner integrations.

Workspace

The browser research environment — chat, theses, watchlists, alerts, and reports, all reading the same core.

All four read from the same standardized core — and a single Stripe-issued token unlocks every one of them at your tier. There is no per-channel divergence in the numbers, because there is only one set of numbers.

Verify it yourself

Every claim on this page is testable from the sample tier — no token, no signup. Pick any S&P500 ticker and inspect the lineage of any fact via verify_fact_lineage:

verify_fact_lineagebash
curl -X POST https://mcp.valuein.biz/mcp \
  -H "Content-Type: application/json" \
  -d '{
    "jsonrpc": "2.0",
    "id": 1,
    "method": "tools/call",
    "params": {
      "name": "verify_fact_lineage",
      "arguments": {
        "ticker": "AAPL",
        "concept": "TotalRevenue",
        "period_end": "2024-12-31"
      }
    }
  }'

The response chains the standardized value back to its source XBRL tag, the SEC accession ID, and the filing URL. If we changed it, you can see why.

Methodology you can audit, data you can trust.

Every step above ships with the data. Read the docs, query the sample tier, and compare against the SEC filings yourself.