TEJHQ
All posts
·7 min read·TejHQ

How TejHQ ingests every NSE & BSE equity for ₹0 a day

End-of-day market data for India costs anywhere from ₹0 to ₹24 lakh a year. We pay zero. Here's the SEBI CMTS Bhavcopy pipeline that powers tej-bazaar.

#pipeline#bhavcopy#nse#bse

Every weekday at 19:00 IST a small Python process wakes up on a free GitHub Actions runner, downloads two CSV files from NSE and BSE, normalizes them into a fourteen-column schema, writes Hive-partitioned Parquet, and pushes the result to HuggingFace. Cost to run: ₹0. Cost to consume: ₹0. Coverage: every equity that traded on either exchange that day - roughly 4,500 symbols.

This post is the full recipe. By the end you should be able to recreate it, fork it, or replace your paid EOD feed with it.

What is a Bhavcopy?

A Bhavcopy (literally price-sheet) is the official end-of-day report that NSE and BSE publish for every instrument that traded that session. It contains OHLCV, ISIN, series code, turnover, and trade count. Both exchanges have published Bhavcopy in some form since the 1990s. It is free, public, and SEBI-mandated. No API key, no auth, no commercial restriction on the raw file.

The reason most retail quants in India still pay ₹40,000+ a year for the same data is not because Bhavcopy is hard to find. It's because parsing a decade of inconsistent CSV formats, handling holiday calendars, deduping series codes, and shipping it as a queryable dataset is annoying enough that nobody bothers. So vendors slap a price tag on the convenience.

The SEBI CMTS format

In late 2023, SEBI mandated a unified post-trade reporting format called CMTS - Common Market Trade Statistics. From 2024-01-01, both NSE and BSE publish their EOD Bhavcopy in a CMTS-aligned CSV with predictable columns and a stable filename. This is the cutover that made tej-bazaar possible.

The CMTS Bhavcopy URL pattern is dead simple:

urls.sh
# NSE - equity Bhavcopy
https://nsearchives.nseindia.com/products/content/sec_bhavdata_full_DDMMYYYY.csv

# NSE CMTS file (preferred)
https://nsearchives.nseindia.com/content/cm/BhavCopy_NSE_CM_0_0_0_YYYYMMDD_F_0000.csv.zip

# BSE CMTS file
https://www.bseindia.com/download/BhavCopy/Equity/BhavCopy_BSE_CM_0_0_0_YYYYMMDD_F_0000.CSV

Three gotchas the casual scraper hits within a week:

  • NSE rejects requests without a real browser User-Agent and Referer.
  • NSE serves a fresh cookie on every nseindia.com hit. You have to capture and replay it.
  • Both exchanges return 200 OK with an empty body on weekends and holidays. Don't trust HTTP status; check Content-Length.

The pipeline, in five stages

tej-bazaar is intentionally boring. There are five named stages and the same shape of data flows through each:

pipeline.py
# 1. fetch     - HTTP, browser headers, idempotent
# 2. parse     - CSV → Polars DataFrame, normalized 14-col schema
# 3. transform - filter equity series, dedupe, validate prices
# 4. write     - partitioned Parquet, zstd, Hive layout
# 5. publish   - HuggingFace upload_folder, content-hash dedup

def run(date: dt.date, exchange: Exchange) -> Path:
    raw   = fetch(date, exchange)
    df    = parse(raw, exchange)
    clean = transform(df)
    return write(clean, date, exchange)

Each stage is independently testable. Each can be re-run without re-running the previous one - the raw CSV cache is cheap, the Parquet output is content-addressed.

Why Polars and not Pandas?

Polars parses the BSE CMTS Bhavcopy in 11 ms on a Macbook M2. Pandas takes 78 ms. That doesn't matter for one date. It matters when you backfill 600 trading days twice while debugging. Polars also has a stricter type system, which catches the inevitable BSE-CSV-with-stray-comma before it lands in your dataset.

Holidays and the calendar problem

The single biggest source of bugs in Indian-market pipelines is treating Saturday and the third Wednesday of October like normal trading days. We delegate to the exchange_calendars library, which maintains NSE and BSE calendars (they share trading days but not always settlement days).

calendar.py
import exchange_calendars as xcal

cal = xcal.get_calendar("XBOM")  # XBOM == BSE; XNSE for NSE
trading_days = cal.sessions_in_range("2024-01-01", "2026-01-15")
# Returns 487 actual trading sessions - not 754 weekday-only days.

Publishing to HuggingFace

We publish to tejhq/indian-markets as Hive-partitioned Parquet with zstd compression. The whole dataset is ~1.4 GB as of January 2026 and grows by ~3 MB per trading day.

HuggingFace Datasets is free for public datasets, has CDN-backed downloads, supports range requests (so DuckDB can do partial Parquet reads over HTTP), and gives us a stable URL with versioning. There is no better hosting option for an open data project in 2026.

Reproducibility note

Every Parquet file we publish includes git_sha and pipeline_version in its metadata footer. If we change a parsing rule, you can pin a Bazaar version and re-derive the exact dataset.

What's next

Phase 4 of the roadmap is corporate actions and adjusted close - splits, bonuses, dividends, symbol changes. That's the last gap between TejHQ and a paid EOD feed for serious backtesters. Expect a follow-up post in Q2 2026.

If you want the data right now, install tej-bazaar from PyPI or just read the dataset on HuggingFace. If you find a parsing bug, please open an issue - that's the only way the dataset gets better.