DuckDB + Parquet recipes for Valuein
Seven query patterns that cover ~80% of what quants and analysts ask the warehouse. Every snippet runs against the real Parquet schema — copy, paste, edit the ticker, run.
Before you start
All snippets assume Parquet files are local — pulled from the Bulk Data API or via the Python SDK. You can also point read_parquet() directly at an HTTPS URL with DuckDB's httpfs extension.
-- One-time setup
INSTALL httpfs;
LOAD httpfs;
-- Read directly from a presigned R2 URL (compute_ready_stream)
SELECT COUNT(*)
FROM read_parquet('https://r2.example.com/full/fact.parquet?...');Start every cross-company query with `references`
The `references` table is a flat join of entity + security + index_membership — one row per security with `is_sp500`, `sector`, `is_active` already populated. Joining the three source tables yourself works too, but adds three scans and an order-of-magnitude planner cost.
SELECT
symbol,
name,
sector,
industry
FROM read_parquet('references.parquet')
WHERE is_sp500 = TRUE
AND is_active = TRUE
ORDER BY sector, symbol;Note · Use this as the spine of any screen, peer search, or universe construction.
Latest filing per company with LATERAL
The most common bug in cross-sectional fundamentals queries: pulling all filings then deduplicating in pandas. DuckDB's LATERAL join lets you correlate one row per company without a self-join.
WITH r AS (
SELECT symbol, sector
FROM read_parquet('references.parquet')
WHERE is_sp500 = TRUE AND is_active = TRUE
)
SELECT
r.symbol,
r.sector,
f.fiscal_year,
f.fiscal_period,
f.numeric_value AS revenue
FROM r,
LATERAL (
SELECT fiscal_year, fiscal_period, numeric_value
FROM read_parquet('fact.parquet') AS f0
WHERE f0.ticker = r.symbol
AND f0.standard_concept = 'TotalRevenue'
AND f0.fiscal_period = 'FY'
ORDER BY f0.fiscal_year DESC, f0.accepted_at DESC
LIMIT 1
) f
ORDER BY revenue DESC NULLS LAST
LIMIT 25;Note · LATERAL evaluates the inner query once per outer row but stops at LIMIT 1 — much cheaper than a full self-join.
Pivot multiple concepts in a single fact scan
Naively, pulling revenue + net income + EPS means three queries against the fact table. A single MAX(CASE) pivot scans `fact` once and returns all three as columns. For 105M rows, that's the difference between 200ms and 2s.
SELECT
ticker,
fiscal_year,
MAX(CASE WHEN standard_concept = 'TotalRevenue' THEN numeric_value END) AS revenue,
MAX(CASE WHEN standard_concept = 'NetIncome' THEN numeric_value END) AS net_income,
MAX(CASE WHEN standard_concept = 'EPS_Diluted' THEN numeric_value END) AS eps_diluted,
MAX(CASE WHEN standard_concept = 'OperatingCashFlow' THEN numeric_value END) AS ocf
FROM read_parquet('fact.parquet')
WHERE ticker IN ('AAPL', 'MSFT', 'GOOGL', 'NVDA')
AND fiscal_period = 'FY'
AND fiscal_year >= 2020
GROUP BY ticker, fiscal_year
ORDER BY ticker, fiscal_year;Note · Add as many concepts as you need — the cost is one extra CASE per concept, not one extra scan.
Use QUALIFY ROW_NUMBER() instead of correlated subqueries
When you need the latest accepted_at for each (ticker, period) — to dedupe original-vs-amended filings — QUALIFY is dramatically cleaner than a correlated MAX subquery.
SELECT
ticker,
fiscal_year,
fiscal_period,
standard_concept,
numeric_value,
accepted_at
FROM read_parquet('fact.parquet')
WHERE standard_concept = 'NetIncome'
AND ticker = 'TSLA'
QUALIFY ROW_NUMBER() OVER (
PARTITION BY ticker, fiscal_year, fiscal_period, standard_concept
ORDER BY accepted_at DESC
) = 1
ORDER BY fiscal_year DESC, fiscal_period;Note · Returns one row per fiscal period — the most recently accepted (i.e. post-amendment) value. Drop the QUALIFY clause to see all amendments.
Quarterly cash flow without YTD bleed
Q2 and Q3 10-Q cash flow figures are year-to-date, not quarter-only. The pipeline computes the incremental quarterly figure as `derived_quarterly_value`. Use COALESCE to prefer the derived value where present.
SELECT
ticker,
fiscal_year,
fiscal_period,
COALESCE(derived_quarterly_value, numeric_value) AS quarterly_ocf
FROM read_parquet('fact.parquet')
WHERE standard_concept = 'OperatingCashFlow'
AND ticker = 'AMZN'
AND fiscal_period IN ('Q1', 'Q2', 'Q3', 'Q4')
AND fiscal_year >= 2022
ORDER BY fiscal_year, fiscal_period;Note · For annual numbers (FY rows) `numeric_value` is correct — the COALESCE just falls back to it when no quarterly derivation exists.
Point-in-time screen: what was investable on a date
Backtesting a strategy requires the universe that existed on the trade date — not today's universe. Filter `index_membership` by date range AND `fact.accepted_at` to ensure both the company AND its data were knowable.
WITH universe_at AS (
SELECT m.cik, m.symbol
FROM read_parquet('index_membership.parquet') AS m
WHERE m.index_name = 'sp500'
AND m.start_date <= DATE '2018-01-01'
AND (m.end_date IS NULL OR m.end_date > DATE '2018-01-01')
),
known_facts AS (
SELECT
ticker,
fiscal_year,
MAX(CASE WHEN standard_concept = 'TotalRevenue' THEN numeric_value END) AS revenue,
MAX(CASE WHEN standard_concept = 'NetIncome' THEN numeric_value END) AS net_income
FROM read_parquet('fact.parquet')
WHERE accepted_at <= TIMESTAMP '2018-01-01 00:00:00'
AND fiscal_period = 'FY'
AND fiscal_year = 2017
GROUP BY ticker, fiscal_year
)
SELECT
u.symbol,
k.revenue,
k.net_income,
k.net_income / NULLIF(k.revenue, 0) AS net_margin
FROM universe_at u
JOIN known_facts k ON k.ticker = u.symbol
ORDER BY net_margin DESC NULLS LAST
LIMIT 50;Note · The universe filter prevents survivorship bias; the accepted_at filter prevents look-ahead bias. Both are required.
Defensive ratio math: ABS and NULLIF every divisor
Capex is reported as negative on cash flow statements; some companies have zero revenue in a given quarter. Wrap every divisor in NULLIF and every cash-flow denominator in ABS.
SELECT
ticker,
fiscal_year,
ocf,
capex,
-- ABS handles companies that report capex as negative;
-- NULLIF prevents division-by-zero on revenue gaps.
ocf - ABS(capex) AS free_cash_flow,
(ocf - ABS(capex)) / NULLIF(revenue, 0) AS fcf_margin
FROM (
SELECT
ticker,
fiscal_year,
MAX(CASE WHEN standard_concept = 'TotalRevenue' THEN numeric_value END) AS revenue,
MAX(CASE WHEN standard_concept = 'OperatingCashFlow' THEN numeric_value END) AS ocf,
MAX(CASE WHEN standard_concept = 'CAPEX' THEN numeric_value END) AS capex
FROM read_parquet('fact.parquet')
WHERE fiscal_period = 'FY'
AND ticker IN ('AAPL', 'MSFT', 'GOOGL', 'AMZN', 'META')
AND fiscal_year = 2024
GROUP BY ticker, fiscal_year
)
ORDER BY fcf_margin DESC NULLS LAST;Note · Make NULLIF on every divisor a habit — DuckDB returns NULL on divide-by-zero, but errors on integer divide-by-zero in some compilation modes.
More patterns?
The Python SDK ships with pre-built query templates for the most common workflows — factor screens, DCF inputs, peer comparables, insider ownership.