Parquet Schema Reference
Fifteen ZSTD-compressed Parquet tables — a 9-table fundamentals dataset plus the 6-table smart-money dataset (Institutional). 111M+ facts, ~78M smart-money rows, 19,000+ entities, 1993–present. All tables use consistent join keys — cik for companies, accession_id for filings.
Entity-Relationship Diagram
All tables link via two universal keys: cik (company) and accession_id (filing). The references table is a derived flat join — use it as your query starting point to avoid manual joins.
┌─────────────┐ ┌────────────────┐ ┌──────────────────────┐
│ entity │ │ security │ │ index_membership │
│─────────────│ │────────────────│ │──────────────────────│
│ cik (PK) │──┐ │ id (PK) │ ┌──│ cik (FK → entity) │
│ name │ └──│ entity_id (FK) │ │ │ index_name │
│ sector │ │ symbol │ │ │ effective_date │
│ sic_code │ │ exchange │ │ │ removal_date │
│ status │ │ is_active │ │ │ removal_reason │
└─────────────┘ └────────────────┘ │ │ successor_cik │
│ │ └──────────────────────┘
│ ┌─────────────────┐ │ ▲
└────────────│ filing │ │ │
│─────────────────│ │ │
│ accession_id(PK)│ │ │
│ entity_id (FK) │ │ │
│ form_type │ │ │
│ filing_date │ │ │
│ accepted_at │ │ │
└─────────────────┘ │ │
│ │ │
└────────────┌──────────────────────┐
│ fact │
│──────────────────────│
│ fact_id (deterministic)│
│ entity_id (FK) │
│ accession_id (FK) │
│ standard_concept │
│ numeric_value │
│ value_current │
│ value_as_filed │
│ accepted_at (PIT) │
└──────────────────────┘
┌──────────────────────────────────────────────────────────────┐
│ references (derived — start every cross-co query here) │
│──────────────────────────────────────────────────────────────│
│ cik, name, sector, industry, sic_code, status, entity_type │
│ security_id, symbol, exchange, mic, is_active, valid_* │
│ figi, composite_figi, share_class_figi │
│ │
│ Index membership? JOIN index_membership ON cik = cik │
│ (same column name both sides — migration 0015). │
└──────────────────────────────────────────────────────────────┘
┌──────────────────────────────────────────────────────────────┐
│ Derived analytic table (recomputed each pipeline run): │
│ ratio │
│ Keys on entity_id (= references.cik) + period_end. │
└──────────────────────────────────────────────────────────────┘
╔══════════════════════════════════════════════════════════════╗
║ SMART-MONEY DATASET (Institutional tier — ~78M rows) ║
╠══════════════════════════════════════════════════════════════╣
║ insider_party ─┬─ insider_filing ─┬─ insider_transaction ║
║ (who) │ (the disclosure)└─ insider_ownership (13D/G)║
║ │ ║
║ institutional_filing ── institutional_holding (13F-HR) ║
║ ║
║ Issuer link (soft): insider_transaction.entity_id, ║
║ *_holding/_ownership.subject_entity_id → entity.cik ║
╚══════════════════════════════════════════════════════════════╝Tables
entity
cikOne row per SEC-registered company. Contains company profile data: name, sector, SIC code, location, CEO, description, and status. The root table — all other tables link back via entity_id = entity.cik.
security
idExchange listings with SCD Type 2 history. Multiple rows per company — one per ticker/exchange combination. FIGI identifiers included for cross-system mapping.
filing
accession_idSEC EDGAR financial-statement filing index — 10-K, 10-Q, 8-K, 20-F, 40-F and amendments. One row per accepted filing. The bridge between companies and their financial facts. Every fact row carries this filing's accepted_at forward as its own PIT anchor.
fact
entity_id + accession_id + standard_conceptThe core financial data table. 111M+ XBRL-sourced data points normalized to 292 standard_concept labels. Each row is a single financial value for a specific company, period, and concept.
valuation
idPipeline-computed DCF and DDM intrinsic values. One row per company per valuation run. Includes WACC, growth rates, and margin of safety vs. market price.
taxonomy_guide
standard_conceptLookup table for standard_concept labels used in the fact table. Maps canonical names to human-readable labels, definitions, and GAAP taxonomy references.
index_membership
idHistorical index constituent records — SP500, RUSSELL1000, RUSSELL2000, RUSSELL3000. One row per (cik, index, membership-period). Essential for point-in-time universe construction — prevents survivorship bias.
references
symbol + cikDerived flat join of entity + security. One row per security. Eliminates the need for 2-table joins for company-level metadata. Always start here for cross-company queries.
ratio
entity_id + ratio_name + period_end + fiscal_period + accepted_atPipeline-computed financial ratios per entity per fiscal period. Append-on-restatement: accepted_at is the PIT vintage (max accepted_at of the input facts), so a restated period is a new vintage row — filter accepted_at <= as_of then take the latest vintage to avoid look-ahead bias, mirroring the fact table.
institutional_holding
idSmart-money · Form 13F-HR holdings — institutional managers' quarterly positions. One row per (filing, CUSIP, share-class, put/call): shares, USD market value, options, and voting authority. Institutional tier only.
insider_transaction
idSmart-money · Form 3/4/5/144 line items — one row per insider transaction, initial holding, or proposed sale. The core insider table: transaction code, shares, price, post-transaction holdings, role, and direct/indirect ownership. Institutional tier only.
insider_filing
idSmart-money · subject-rooted insider disclosure filing records (Form 3/4/5/SC 13D/SC 13G/144 + amendments) — kept separate from the financial filing table. Parent of insider_transaction and insider_ownership via accession_id. Institutional tier only.
insider_ownership
idSmart-money · SC 13D / SC 13G beneficial-ownership disclosures (5%+ stakes). One row per reporting person; group filings emit multiple rows. Carries percent owned and the full voting/dispositive power breakdown. Institutional tier only.
insider_party
idSmart-money · deduplicated directory of every insider / reporting person referenced across Form 3/4/5/144. Keyed on CIK when SEC-registered, otherwise on a normalized name. Institutional tier only.
institutional_filing
idSmart-money · manager-rooted 13F-HR / 13F-NT filing records (and amendments) — distinct from the issuer-rooted filing table because a 13F has no single subject issuer. FK to insider_party for the manager. Institutional tier only.
Key Columns Guide
The most important columns across tables. Master these and you can write any query.
| Column | Table | Type | Nullable | Description |
|---|---|---|---|---|
accepted_atPIT | fact | TIMESTAMPTZ | No | PIT timestamp — when this fact became publicly known via SEC EDGAR acceptance. Use accepted_at <= trade_date for look-ahead-free backtests. |
standard_concept | fact | VARCHAR | Yes | Canonical normalized concept name (e.g. Revenues, NetIncomeLoss). Use this — not raw concept — for cross-company queries. |
numeric_value | fact | FLOAT | Yes | Parsed financial value. NULL for non-numeric XBRL facts. Always use ABS() on capex; divide by NULLIF(denom, 0) on ratios. |
derived_quarterly_value | fact | FLOAT | Yes | Q2/Q3 10-Q cash flows adjusted from YTD to single-quarter. Use COALESCE(derived_quarterly_value, numeric_value) for all cash flow metrics. |
cik | index_membership / references | VARCHAR | No | Index membership join key. Same column name on both tables (since migration 0015) — JOIN index_membership ON references.cik = index_membership.cik. Filter index_name = 'SP500' AND removal_date IS NULL for the current S&P500. |
is_active | references / security | BOOLEAN | No | TRUE if this ticker is currently active (valid_to IS NULL). Filter to avoid duplicate rows from ticker changes. |
accession_id | filing | VARCHAR | No | SEC EDGAR accession number — the unique identifier for a filing. Format: XXXXXXXXXX-YY-ZZZZZZ. The join key between filing and fact. |
cik | entity / references | VARCHAR | No | SEC Central Index Key — 10-digit company identifier. The universal join key across all tables. entity.cik = security.entity_id = filing.entity_id = fact.entity_id. |
All Fields by Table
Quick-reference field list. For full descriptions and SEC XBRL tag mappings, use the Data Catalog.
DuckDB Usage
All Parquet files load natively with DuckDB's read_parquet(). DuckDB handles ZSTD decompression automatically and supports lazy projection — only the columns you SELECT are read from disk.
-- Load a table directly from disk
SELECT *
FROM read_parquet('fact.parquet')
WHERE standard_concept = 'Revenues'
AND accepted_at <= '2023-01-01'
LIMIT 10;
-- Query directly from the API URL (presigned or streamed).
-- Membership goes through index_membership — references no longer
-- carries an is_sp500 flag (dropped 2026-05-02).
SELECT r.symbol, fa.numeric_value / 1e9 AS revenue_billions
FROM read_parquet('https://data.valuein.biz/v1/sp500/references') r
JOIN read_parquet('https://data.valuein.biz/v1/sp500/index_membership') im
ON im.cik = r.cik
JOIN read_parquet('https://data.valuein.biz/v1/sp500/fact') fa
ON fa.entity_id = r.cik
WHERE im.index_name = 'SP500'
AND im.removal_date IS NULL
AND fa.standard_concept = 'Revenues'
AND fa.accepted_at <= '2024-01-01'
ORDER BY revenue_billions DESC
LIMIT 20;ZSTD compression
All files are ZSTD-compressed. Typical compression ratio is 6–10x vs uncompressed. DuckDB decompresses transparently.
Column pruning
SELECT only the columns you need — DuckDB reads only those column chunks from the Parquet file, not the full table.
Predicate pushdown
DuckDB pushes WHERE clauses into Parquet row group statistics, skipping groups that cannot contain matching rows.
Explore the full field catalog
Search and filter all fields with SEC XBRL tag mappings, types, and example values in the interactive Data Catalog.