Polars vs DuckDB on 2.6M Indian-market rows: I benchmarked both
Same data, same query, two engines. Polars wins on pure scan speed; DuckDB wins on remote Parquet pruning. Numbers, code, and the cases where each loses.
TejHQ ships 2.69 million rows of Indian equity OHLCV across 597 trading days, as of early February 2026. Two questions come up constantly in our Discussions tab: which DataFrame engine is faster, and is it worth pulling the whole dataset locally or querying it over HTTP?
I ran the same six queries through Polars and DuckDB, locally and remotely, on a Macbook Pro M2 with a fast home connection (≈ 600 Mbps down). Numbers below are the median of 5 runs, warmed cache.
The setup
The dataset is the standard tejhq/indian-markets Hive layout:
tejhq/indian-markets/
├── nse/year=2024/month=01/date=2024-01-02.parquet
├── nse/year=2024/month=01/date=2024-01-03.parquet
├── ...
└── bse/year=2026/month=02/date=2026-02-07.parquetEach daily Parquet is ≈ 180 KB compressed (zstd level 3). Total dataset on disk: 1.42 GB. Total rows: 2,693,428.
Query 1 - full scan, count by exchange
import polars as pl
df = pl.scan_parquet("data/**/*.parquet", hive_partitioning=True)
df.group_by("series").len().collect()SELECT series, COUNT(*)
FROM read_parquet('data/**/*.parquet', hive_partitioning=1)
GROUP BY series;Polars: 0.41 s. DuckDB: 0.62 s. Polars is faster on raw columnar scan when the data is already on local disk. Both blow away anything pandas can do here.
Query 2 - single symbol, two-year history
This is the realistic backtester query: give me RELIANCE for the last 24 months.
(pl.scan_parquet("data/nse/**/*.parquet", hive_partitioning=True)
.filter(pl.col("symbol") == "RELIANCE")
.filter(pl.col("date") >= dt.date(2024, 2, 1))
.collect())Polars 0.18 s. DuckDB 0.21 s. Effectively a tie - both engines push the date predicate into Hive partition pruning and never open the irrelevant files.
Query 3 - remote, no local files
This is the moment of truth. Can you skip downloading the dataset entirely and just query it from HuggingFace over HTTPS?
INSTALL httpfs; LOAD httpfs;
SELECT date, close
FROM read_parquet(
'hf://datasets/tejhq/indian-markets/nse/**/*.parquet',
hive_partitioning = 1
)
WHERE symbol = 'RELIANCE'
AND date >= '2024-02-01';DuckDB remote: 1.9 s. That's astonishing. DuckDB makes range requests against the Parquet footer of each relevant file, fetches only the columns and row groups needed, and answers in under two seconds - never downloading the full ~700 MB of NSE data.
Polars over remote HTTP, by contrast, pulls the full files. Same query took 23 s. If you're not downloading the whole dataset, prefer DuckDB.
Query 4 - cross-sectional, all symbols on one day
Polars 0.04 s. DuckDB 0.07 s. Both engines see one Hive partition (one file), and you're done.
Query 5 - join with a static reference table
Joining ISIN against a sectors table (~1.5K rows): Polars 0.09 s, DuckDB 0.11 s. Tie.
Query 6 - window function, 20-day moving average per symbol
SELECT symbol, date, close,
AVG(close) OVER (
PARTITION BY symbol
ORDER BY date
ROWS BETWEEN 19 PRECEDING AND CURRENT ROW
) AS sma20
FROM read_parquet('data/nse/**/*.parquet', hive_partitioning=1)
WHERE date >= '2025-01-01';DuckDB 1.3 s. Polars equivalent (over("symbol")) 1.6 s. DuckDB pulls slightly ahead on complex window functions over a year of data.
Verdict
tl;dr
- + Local + ad hoc analysis: Polars. Faster on plain scans, nicer chained API.
- + Local + heavy SQL / windows: DuckDB. Mature query optimizer wins.
- + Remote dataset, you don't want a local copy: DuckDB with
httpfs. No contest. - + Production batch ETL: Polars. Memory predictability matters more than peak speed.
The actual answer for almost everyone reading this: use both. They share the same Parquet files. There is no migration cost. We use DuckDB for exploratory SQL in notebooks and Polars for everything that ships.
Run the benchmarks yourself - the script is in the repo. If your numbers diverge meaningfully from ours, please open an issue with your hardware. We update this post when major versions of either engine ship.