TEJHQ
All posts
·6 min read·TejHQ

Polars vs DuckDB on 2.6M Indian-market rows: I benchmarked both

Same data, same query, two engines. Polars wins on pure scan speed; DuckDB wins on remote Parquet pruning. Numbers, code, and the cases where each loses.

#polars#duckdb#benchmark#parquet

TejHQ ships 2.69 million rows of Indian equity OHLCV across 597 trading days, as of early February 2026. Two questions come up constantly in our Discussions tab: which DataFrame engine is faster, and is it worth pulling the whole dataset locally or querying it over HTTP?

I ran the same six queries through Polars and DuckDB, locally and remotely, on a Macbook Pro M2 with a fast home connection (≈ 600 Mbps down). Numbers below are the median of 5 runs, warmed cache.

The setup

The dataset is the standard tejhq/indian-markets Hive layout:

layout.txt
tejhq/indian-markets/
├── nse/year=2024/month=01/date=2024-01-02.parquet
├── nse/year=2024/month=01/date=2024-01-03.parquet
├── ...
└── bse/year=2026/month=02/date=2026-02-07.parquet

Each daily Parquet is ≈ 180 KB compressed (zstd level 3). Total dataset on disk: 1.42 GB. Total rows: 2,693,428.

Query 1 - full scan, count by exchange

q1_polars.py
import polars as pl
df = pl.scan_parquet("data/**/*.parquet", hive_partitioning=True)
df.group_by("series").len().collect()
q1_duckdb.sql
SELECT series, COUNT(*)
FROM read_parquet('data/**/*.parquet', hive_partitioning=1)
GROUP BY series;

Polars: 0.41 s. DuckDB: 0.62 s. Polars is faster on raw columnar scan when the data is already on local disk. Both blow away anything pandas can do here.

Query 2 - single symbol, two-year history

This is the realistic backtester query: give me RELIANCE for the last 24 months.

q2_polars.py
(pl.scan_parquet("data/nse/**/*.parquet", hive_partitioning=True)
   .filter(pl.col("symbol") == "RELIANCE")
   .filter(pl.col("date") >= dt.date(2024, 2, 1))
   .collect())

Polars 0.18 s. DuckDB 0.21 s. Effectively a tie - both engines push the date predicate into Hive partition pruning and never open the irrelevant files.

Query 3 - remote, no local files

This is the moment of truth. Can you skip downloading the dataset entirely and just query it from HuggingFace over HTTPS?

q3_duckdb_remote.sql
INSTALL httpfs; LOAD httpfs;
SELECT date, close
FROM read_parquet(
  'hf://datasets/tejhq/indian-markets/nse/**/*.parquet',
  hive_partitioning = 1
)
WHERE symbol = 'RELIANCE'
  AND date >= '2024-02-01';

DuckDB remote: 1.9 s. That's astonishing. DuckDB makes range requests against the Parquet footer of each relevant file, fetches only the columns and row groups needed, and answers in under two seconds - never downloading the full ~700 MB of NSE data.

Polars over remote HTTP, by contrast, pulls the full files. Same query took 23 s. If you're not downloading the whole dataset, prefer DuckDB.

Query 4 - cross-sectional, all symbols on one day

Polars 0.04 s. DuckDB 0.07 s. Both engines see one Hive partition (one file), and you're done.

Query 5 - join with a static reference table

Joining ISIN against a sectors table (~1.5K rows): Polars 0.09 s, DuckDB 0.11 s. Tie.

Query 6 - window function, 20-day moving average per symbol

q6_duckdb.sql
SELECT symbol, date, close,
       AVG(close) OVER (
         PARTITION BY symbol
         ORDER BY date
         ROWS BETWEEN 19 PRECEDING AND CURRENT ROW
       ) AS sma20
FROM read_parquet('data/nse/**/*.parquet', hive_partitioning=1)
WHERE date >= '2025-01-01';

DuckDB 1.3 s. Polars equivalent (over("symbol")) 1.6 s. DuckDB pulls slightly ahead on complex window functions over a year of data.

Verdict

tl;dr

  • + Local + ad hoc analysis: Polars. Faster on plain scans, nicer chained API.
  • + Local + heavy SQL / windows: DuckDB. Mature query optimizer wins.
  • + Remote dataset, you don't want a local copy: DuckDB with httpfs. No contest.
  • + Production batch ETL: Polars. Memory predictability matters more than peak speed.

The actual answer for almost everyone reading this: use both. They share the same Parquet files. There is no migration cost. We use DuckDB for exploratory SQL in notebooks and Polars for everything that ships.

Run the benchmarks yourself - the script is in the repo. If your numbers diverge meaningfully from ours, please open an issue with your hardware. We update this post when major versions of either engine ship.