Parquet Schema Reference
Eight ZSTD-compressed Parquet tables. 105M+ facts, 12,000+ tickers, 1994–present. All tables use consistent join keys — cik for companies, accession_id for filings.
Entity-Relationship Diagram
All tables link via two universal keys: cik (company) and accession_id (filing). The references table is a derived flat join — use it as your query starting point to avoid manual joins.
┌─────────────┐ ┌────────────────┐ ┌──────────────────┐
│ entity │ │ security │ │ index_membership │
│─────────────│ │────────────────│ │──────────────────│
│ cik (PK) │──┐ │ id (PK) │──┐ │ id (PK) │
│ name │ └──│ entity_id (FK) │ └──│ security_id (FK) │
│ sector │ │ symbol │ │ index_name │
│ sic_code │ │ exchange │ │ start_date │
│ status │ │ is_active │ │ end_date │
└─────────────┘ └────────────────┘ └──────────────────┘
│
│ ┌─────────────────┐
└────────────│ filing │
│─────────────────│
│ accession_id(PK)│
│ entity_id (FK) │
│ form_type │
│ filing_date │
│ accepted_at │
└─────────────────┘
│
└────────────┌──────────────────┐
│ fact │
│──────────────────│
│ entity_id (FK) │
│ accession_id(FK) │
│ standard_concept │
│ numeric_value │
│ knowledge_at │
└──────────────────┘
┌────────────────────────────────────────────────────────┐
│ references (derived — start every cross-co query here)│
│────────────────────────────────────────────────────────│
│ cik, name, sector, industry, sic_code, status │
│ symbol, exchange, is_active │
│ is_sp500 ← replaces 3-table join │
└────────────────────────────────────────────────────────┘Tables
entity
cikOne row per SEC-registered company. Contains company profile data: name, sector, SIC code, location, CEO, description, and status. The root table — all other tables link back via entity_id = entity.cik.
security
idExchange listings with SCD Type 2 history. Multiple rows per company — one per ticker/exchange combination. FIGI identifiers included for cross-system mapping.
filing
accession_idSEC EDGAR filing index. One row per accepted filing. The bridge between companies and their financial facts. accepted_at is the source for fact.knowledge_at.
fact
entity_id + accession_id + standard_conceptThe core financial data table. 105M+ XBRL-sourced data points normalized to ~150 standard_concept labels. Each row is a single financial value for a specific company, period, and concept.
valuation
idPipeline-computed DCF and DDM intrinsic values. One row per company per valuation run. Includes WACC, growth rates, and margin of safety vs. market price.
taxonomy_guide
standard_conceptLookup table for standard_concept labels used in the fact table. Maps canonical names to human-readable labels, definitions, and GAAP taxonomy references.
index_membership
idHistorical index constituent records. One row per security/index/membership-period. Essential for point-in-time universe construction — prevents survivorship bias.
references
symbol + cikDerived flat join of entity + security + index_membership. One row per security. Eliminates the need for 3-table joins in most queries. Always start here.
Key Columns Guide
The most important columns across tables. Master these and you can write any query.
| Column | Table | Type | Nullable | Description |
|---|---|---|---|---|
knowledge_atPIT | fact | TIMESTAMPTZ | No | PIT timestamp — when this fact became publicly known via SEC EDGAR acceptance. Use knowledge_at <= trade_date for look-ahead-free backtests. |
standard_concept | fact | VARCHAR | Yes | Canonical normalized concept name (e.g. Revenues, NetIncomeLoss). Use this — not raw concept — for cross-company queries. |
numeric_value | fact | FLOAT | Yes | Parsed financial value. NULL for non-numeric XBRL facts. Always use ABS() on capex; divide by NULLIF(denom, 0) on ratios. |
derived_quarterly_value | fact | FLOAT | Yes | Q2/Q3 10-Q cash flows adjusted from YTD to single-quarter. Use COALESCE(derived_quarterly_value, numeric_value) for all cash flow metrics. |
is_sp500 | references | BOOLEAN | No | TRUE if this security is currently a member of the S&P 500. Fastest filter for S&P 500 screens — no join to index_membership needed. |
is_active | references / security | BOOLEAN | No | TRUE if this ticker is currently active (valid_to IS NULL). Filter to avoid duplicate rows from ticker changes. |
accession_id | filing | VARCHAR | No | SEC EDGAR accession number — the unique identifier for a filing. Format: XXXXXXXXXX-YY-ZZZZZZ. The join key between filing and fact. |
cik | entity / references | VARCHAR | No | SEC Central Index Key — 10-digit company identifier. The universal join key across all tables. entity.cik = security.entity_id = filing.entity_id = fact.entity_id. |
All Fields by Table
Quick-reference field list. For full descriptions and SEC XBRL tag mappings, use the Data Catalog.
entity
23 fields| Field | Type | Description |
|---|---|---|
cik | VARCHAR | SEC Central Index Key — 10-digit unique company identifier |
name | TEXT | Legal registered company name |
lei | VARCHAR | Legal Entity Identifier (ISO 17442 20-character code) |
industry | TEXT | Industry classification description |
sector | VARCHAR | Broad market sector (e.g. Technology, Healthcare, Financials) |
sic_code | VARCHAR | Standard Industrial Classification code (4-digit) |
fiscal_year_end | VARCHAR | Fiscal year end as MMDD (e.g. 1231 = December 31) |
location | TEXT | Company headquarters location (city, state, country) |
website | TEXT | Company primary website URL |
ceo | TEXT | Chief Executive Officer name at most recent filing |
founded_year | SMALLINT | Year the company was founded |
description | TEXT | Company business description extracted from SEC filings |
status | VARCHAR | Entity status: ACTIVE, INACTIVE, or DELISTED. Filter status != 'ACTIVE' to include all non-active companies |
entity_type | TEXT | SEC-defined filer category (e.g. Non-accelerated Filer, Large Accelerated Filer) |
ein | TEXT | Employer Identification Number (IRS tax ID) |
owner_org | TEXT | Parent or controlling organization name |
investor_website | TEXT | Investor relations website URL |
category | TEXT | SEC registrant category |
state_of_incorporation | TEXT | US state or country of incorporation |
phone | TEXT | Company contact phone number |
address | TEXT | Principal registered business address |
last_seen_at | DATE | Date this entity last appeared in any SEC filing |
ingested_at | TIMESTAMPTZ | Timestamp when Valuein first ingested this entity record |
security
17 fields| Field | Type | Description |
|---|---|---|
id | INTEGER | Surrogate primary key for the security record |
entity_id | VARCHAR | Foreign key to entity.cik — the issuing company |
symbol | VARCHAR | Exchange ticker symbol |
exchange | TEXT | Stock exchange name (e.g. NASDAQ, NYSE) |
mic | VARCHAR | Market Identification Code (ISO 10383) |
valid_from | DATE | SCD Type 2 start date — when this ticker became active on this exchange |
valid_to | DATE | SCD Type 2 end date — when this ticker was retired. NULL means currently active |
is_active | BOOLEAN | Derived: TRUE when valid_to IS NULL. Use to filter current tickers only |
figi | VARCHAR | Financial Instrument Global Identifier at share-class level (OpenFIGI) |
composite_figi | VARCHAR | Composite FIGI at the exchange level |
share_class_figi | VARCHAR | Share class FIGI |
security_type | VARCHAR | Instrument type (e.g. Common Stock, ADR, ETF) |
market_sector | VARCHAR | FIGI market sector classification |
figi_ticker | VARCHAR | Ticker symbol as recorded in the OpenFIGI registry |
figi_name | TEXT | Security name from the OpenFIGI registry |
figi_exch_code | VARCHAR | Exchange code from the OpenFIGI registry |
ingested_at | TIMESTAMPTZ | Timestamp when Valuein first ingested this security record |
filing
11 fields| Field | Type | Description |
|---|---|---|
accession_id | VARCHAR | SEC EDGAR accession number — unique filing identifier (format: XXXXXXXXXX-YY-ZZZZZZ) |
entity_id | VARCHAR | Foreign key to entity.cik — the filing company |
filing_date | DATE | Date the SEC received and accepted the filing. Use this for PIT backtest filters — not report_date |
form_type | VARCHAR | Filing type: 10-K (annual), 10-Q (quarterly), 20-F (foreign annual), 8-K (current report), etc. |
filing_url | TEXT | Full URL to the filing document on SEC EDGAR |
is_amendment | BOOLEAN | TRUE if this is an amended filing (e.g. 10-K/A, 10-Q/A) |
amendment_no | SMALLINT | Amendment sequence number — 0 for the original filing, 1 for the first amendment, etc. |
report_date | DATE | Fiscal period end date covered by this filing. Using this as a PIT filter introduces look-ahead bias — use filing_date instead |
items | TEXT[] | Array of 8-K item numbers reported (e.g. ['1.01', '9.01']). NULL for non-8-K filings |
accepted_at | TIMESTAMPTZ | Exact timestamp the SEC accepted the filing (millisecond precision). Source for fact.knowledge_at |
ingested_at | TIMESTAMPTZ | Timestamp when Valuein ingested this filing record |
fact
30 fields| Field | Type | Description |
|---|---|---|
entity_id | VARCHAR | Foreign key to entity.cik |
accession_id | VARCHAR | Foreign key to filing.accession_id |
concept | TEXT | Raw XBRL tag as filed by the company (e.g. us-gaap:Revenues). Use standard_concept for cross-company comparisons |
standard_concept | VARCHAR | Canonical normalized concept label (e.g. TotalRevenue). Maps 10K+ raw XBRL tags to ~150 standard concepts for cross-company queries |
priority | SMALLINT | Standardization priority — lower value = higher confidence in the concept mapping |
label | TEXT | Human-readable label from the company's XBRL filing context |
value | TEXT | Raw string value as filed in the XBRL document |
numeric_value | FLOAT | Parsed numeric value. NULL for non-numeric or string facts |
unit | VARCHAR | Unit of measure: USD, shares, pure (dimensionless ratio), etc. |
period_start | DATE | Reporting period start date. NULL for instant/point-in-time facts (e.g. balance sheet items) |
period_end | DATE | Reporting period end date. Table is range-partitioned on this column by year |
fiscal_year | SMALLINT | Fiscal year number (e.g. 2024) |
fiscal_period | VARCHAR | Fiscal period label: FY (annual), Q1, Q2, Q3, Q4 |
frame | VARCHAR | XBRL frame identifier linking this fact to a specific reporting period context |
statement_type | VARCHAR | Financial statement category: income, balance_sheet, cash_flow, dei, notes |
accuracy_score | REAL | Standardization confidence: 1.0 = manual/exact, 0.85 = taxonomy match, 0.55 = keyword fallback, 0.0 = unmapped. Use >= 0.70 for analytics |
confidence_score | FLOAT | Composite data quality confidence metric (0.0–1.0) |
is_negative | BOOLEAN | TRUE if the originally reported value was negative |
is_audited | BOOLEAN | TRUE if this fact comes from an audited annual filing (10-K or 20-F) |
is_estimated | BOOLEAN | TRUE if the value is estimated rather than directly reported |
data_quality | TEXT | Data quality flag: reported, estimated, or provisional |
knowledge_at | TIMESTAMPTZ | Point-in-time timestamp — when this fact became publicly known via SEC EDGAR. Filter knowledge_at <= trade_date for look-ahead-free backtests |
ingested_at | TIMESTAMPTZ | Timestamp when Valuein ingested this fact |
accounting_standard | VARCHAR | Accounting framework: US-GAAP or IFRS |
reporting_currency | VARCHAR | ISO 4217 currency code for the reported value (e.g. USD, EUR, GBP) |
period_span_days | INTEGER | Number of calendar days spanned by this reporting period |
is_cumulative | BOOLEAN | TRUE if the value is cumulative year-to-date rather than for the period in isolation (common in 10-Q cash flow statements) |
derived_quarterly_value | FLOAT | Quarterly value derived by subtracting prior YTD periods. Use COALESCE(derived_quarterly_value, numeric_value) for single-quarter cash flow metrics |
derivation_type | VARCHAR | Method used to derive the quarterly value: ytd_subtraction, direct, or null if not derived |
derivation_uncertain | BOOLEAN | TRUE if the quarterly derivation has reduced confidence due to fiscal period ambiguity |
valuation
26 fields| Field | Type | Description |
|---|---|---|
id | BIGINT | Surrogate primary key for the valuation record |
entity_id | VARCHAR | Foreign key to entity.cik — the company being valued |
security_id | INTEGER | Foreign key to security.id — the specific security valued (NULL if company-level only) |
valuation_date | DATE | Date the valuation was computed, based on the most recent available financials at that point in time |
model_type | VARCHAR | Valuation model used: dcf (Discounted Cash Flow) or ddm (Dividend Discount Model) |
model_version | VARCHAR | Version string of the valuation model algorithm (e.g. 1.0) |
intrinsic_value | NUMERIC | Total estimated intrinsic equity value in USD |
per_share_value | NUMERIC | Intrinsic value per diluted share in USD |
margin_of_safety | NUMERIC | Percentage discount of current market price to intrinsic value. Positive = undervalued, negative = overvalued |
valuation_label | VARCHAR | Summary label derived from margin of safety: undervalued, fairvalued, or overvalued |
base_earnings | NUMERIC | Base earnings or free cash flow used as the primary model input (USD) |
growth_rate | NUMERIC | Near-term earnings or FCF growth rate assumption applied in the high-growth stage |
terminal_rate | NUMERIC | Perpetuity terminal growth rate assumption for the Gordon Growth terminal value calculation |
discount_rate | NUMERIC | Discount rate applied: WACC for DCF models, required return for DDM models |
growth_years | INTEGER | Number of years modelled in the explicit high-growth stage before applying terminal value |
terminal_years | INTEGER | Number of years used to estimate the terminal value horizon |
shares_outstanding | BIGINT | Diluted shares outstanding used to convert total equity value to a per-share figure |
current_price | NUMERIC | Market price per share at the valuation date (USD) — used to compute margin of safety |
base_dividend | NUMERIC | Annual dividend per share used as the DDM base input (USD) |
dividend_growth_rate | NUMERIC | Expected long-term dividend growth rate used in the DDM Gordon Growth Model |
payout_ratio | NUMERIC | Dividend payout ratio (dividends ÷ earnings) used in DDM calibration |
data_quality | VARCHAR | Quality of the inputs used: reported (direct from filings), estimated (derived), or provisional (incomplete data) |
source_period | VARCHAR | Fiscal period used as the model input (e.g. FY2024, TTM-2024Q3) |
assumptions | JSONB | JSON overflow field for additional model parameters and sensitivity analysis inputs |
created_at | TIMESTAMPTZ | Timestamp when this valuation record was first computed |
updated_at | TIMESTAMPTZ | Timestamp when this valuation record was last recomputed (ON CONFLICT DO UPDATE) |
taxonomy_guide
13 fields| Field | Type | Description |
|---|---|---|
standard_concept | TEXT | Canonical concept name (primary key) — matches fact.standard_concept for joins |
human_name | TEXT | Human-readable display name for the concept (e.g. Total Revenue) |
definition | TEXT | Plain-English definition of what this financial concept measures |
unit_type | VARCHAR | Expected unit of measure: monetary, shares, ratio, pure, etc. |
balance_type | VARCHAR | Accounting balance type: debit or credit |
source_reference | TEXT | US GAAP taxonomy reference identifier for this concept |
level | SMALLINT | Hierarchy level in the taxonomy structure (lower = higher in the hierarchy) |
statement_type | VARCHAR | Financial statement this concept belongs to: income, balance_sheet, cash_flow, dei, notes |
display_order | SMALLINT | Preferred display order within its statement type for financial statement presentation |
is_active | BOOLEAN | TRUE if this concept is actively used in the current GAAP taxonomy |
created_at | TIMESTAMPTZ | Timestamp when this concept was added to the taxonomy guide |
updated_at | TIMESTAMPTZ | Timestamp when this concept definition was last updated |
gaap_ifrs_comparable | BOOLEAN | TRUE if this concept has a directly comparable IFRS equivalent |
index_membership
5 fields| Field | Type | Description |
|---|---|---|
id | INTEGER | Surrogate primary key |
security_id | INTEGER | Foreign key to security.id — the index constituent |
index_name | VARCHAR | Index identifier (e.g. SP500, RUSSELL2000, MSCI_US) |
start_date | DATE | Date the security was added to the index |
end_date | DATE | Date the security was removed from the index. NULL means it is currently a member |
references
18 fields| Field | Type | Description |
|---|---|---|
cik | VARCHAR | SEC Central Index Key — 10-digit unique company identifier (from entity) |
name | TEXT | Legal registered company name (from entity) |
sector | TEXT | Broad market sector (from entity) |
industry | TEXT | Industry classification description (from entity) |
sic_code | VARCHAR | Standard Industrial Classification code 4-digit (from entity) |
status | VARCHAR | Entity status: ACTIVE, INACTIVE, or DELISTED (from entity) |
entity_type | VARCHAR | SEC-defined filer category (from entity) |
security_id | INTEGER | Surrogate PK of the security record (from security) |
symbol | VARCHAR | Exchange ticker symbol (from security) |
exchange | VARCHAR | Stock exchange name e.g. NASDAQ, NYSE (from security) |
mic | VARCHAR | Market Identification Code ISO 10383 (from security) |
valid_from | DATE | SCD Type 2 start date — when this ticker became active (from security) |
valid_to | DATE | SCD Type 2 end date — NULL means currently active (from security) |
is_active | BOOLEAN | TRUE when valid_to IS NULL — use to filter current tickers only (from security) |
figi | VARCHAR | Financial Instrument Global Identifier share-class level (from security) |
composite_figi | VARCHAR | Composite FIGI at the exchange level (from security) |
share_class_figi | VARCHAR | Share class FIGI (from security) |
is_sp500 | BOOLEAN | TRUE if the security is currently an S&P500 constituent (from index_membership) |
DuckDB Usage
All Parquet files load natively with DuckDB's read_parquet(). DuckDB handles ZSTD decompression automatically and supports lazy projection — only the columns you SELECT are read from disk.
-- Load a table directly from disk
SELECT *
FROM read_parquet('fact.parquet')
WHERE standard_concept = 'Revenues'
AND knowledge_at <= '2023-01-01'
LIMIT 10;
-- Query directly from the API URL (presigned or streamed)
SELECT r.symbol, fa.numeric_value / 1e9 AS revenue_billions
FROM read_parquet('https://data.valuein.biz/v1/sp500/references') r
JOIN read_parquet('https://data.valuein.biz/v1/sp500/fact') fa
ON fa.entity_id = r.cik
WHERE r.is_sp500 = TRUE
AND fa.standard_concept = 'Revenues'
AND fa.knowledge_at <= '2024-01-01'
ORDER BY revenue_billions DESC
LIMIT 20;ZSTD compression
All files are ZSTD-compressed. Typical compression ratio is 6–10x vs uncompressed. DuckDB decompresses transparently.
Column pruning
SELECT only the columns you need — DuckDB reads only those column chunks from the Parquet file, not the full table.
Predicate pushdown
DuckDB pushes WHERE clauses into Parquet row group statistics, skipping groups that cannot contain matching rows.
Explore the full field catalog
Search and filter all fields with SEC XBRL tag mappings, types, and example values in the interactive Data Catalog.