TEJHQ
All posts
·7 min read·TejHQ

Sixteen years of NSE bhavcopy, finally on tap

What it took to backfill NSE end-of-day prices from 2010 to today: two URL schemes, two CSV schemas, a SEBI format cutover in the middle, and a Go SQL driver that hates NULL.

#bhavcopy#nse#backfill#engineering

TejHQ now serves NSE end-of-day prices and corporate actions from 4 January 2010 all the way to last night's close. That is 4,047 trading days, 7 million OHLCV rows, 39,000+ corp actions, all queryable through api.tejhq.dev and downloadable as Parquet from R2 and HuggingFace. Free, no auth.

When we went live two months ago we only had data from July 2024 onward, the date NSE adopted the SEBI CMTS bhavcopy format. Everything before that lived behind a different URL with different column names, and we punted. This is the build log of what it took to fill the gap, and the three small fires we lit on the way.

Two URLs, one bhavcopy

NSE publishes the daily bhavcopy at two different endpoints, and the cutover is hard: there is no overlap, no redirect, no graceful fallback.

urls.txt
# Modern (2024-07-08 onward) - SEBI CMTS format
https://nsearchives.nseindia.com/content/cm/BhavCopy_NSE_CM_0_0_0_YYYYMMDD_F_0000.csv.zip

# Legacy (1995 - 2024-07-05) - the old format
https://archives.nseindia.com/content/historical/EQUITIES/YYYY/MMM/cmDDmmmYYYYbhav.csv.zip

Pick the right URL for the right date. Our fetcher does this with a single date constant:

fetch.py
NSE_LEGACY_CUTOVER = date(2024, 7, 8)

def fetch_nse(d: date, raw_dir: Path) -> Path:
    if d < NSE_LEGACY_CUTOVER:
        url = NSE_BHAVCOPY_URL_LEGACY.format(
            yyyy=d.year, mmm=d.strftime("%b").upper(), dd=f"{d.day:02d}"
        )
    else:
        url = NSE_BHAVCOPY_URL.format(yyyymmdd=d.strftime("%Y%m%d"))
    ...

Three letters of date formatting matter here. The legacy URL wants cm04JAN2010bhav.csv.zip with the month uppercased to three letters. Get it wrong and you get a polite NSE 404 page with no hint that the year and month exist as a directory listing one URL up.

Two CSV schemas, one parser

The modern CSV has a column called TradDt. The legacy one has SYMBOL in the first column and TIMESTAMP for the trading date. We dispatch on the header row:

parse.py
def parse_bhavcopy(csv: bytes, exchange: str) -> pl.DataFrame:
    df = pl.read_csv(io.BytesIO(csv))
    cols = set(df.columns)
    if "TradDt" in cols:
        return _parse_modern(df, exchange)
    if {"SYMBOL", "TIMESTAMP"} <= cols:
        return _parse_legacy_nse(df, exchange)
    raise ValueError(f"schema not recognised: {sorted(cols)}")

The legacy parser is more lenient than its modern sibling. Pre-2012 NSE rows have no ISIN column and no TOTALTRADES column. We fill both with null and let the downstream layers decide what to do with absent fields:

parse.py
_LEGACY_NSE_COLUMN_MAP = {
    "SYMBOL": "symbol", "SERIES": "series",
    "OPEN": "open", "HIGH": "high", "LOW": "low",
    "CLOSE": "close", "LAST": "last", "PREVCLOSE": "prev_close",
    "TOTTRDQTY": "volume", "TOTTRDVAL": "turnover",
    "TIMESTAMP": "date",
    "TOTALTRADES": "trades",   # appears 2012+
    "ISIN": "isin",            # appears 2012+
}
_LEGACY_NSE_OPTIONAL = {"TOTALTRADES", "ISIN"}

Honest gaps in the data beat invented data. If 1 January 2010 did not record a trade count, the row gets trades = null, not trades = 0. The downstream surfaces decide how to render null. Which leads to the third fire.

The Go SQL driver hates NULL

Once the parquet was written and pushed to R2, we hit a clean smoke test for /v1/ohlcv/nse/RELIANCE?from=2010-01-04&to=2010-01-15 and got a 500. The log line from Cloud Run was crisp:

error.log
sql: Scan error on column index 2, name "isin":
  converting NULL to string is unsupported

The Go database/sql driver cannot scan a SQL NULL into a plain string or int64. The textbook fix is to change every wire-model field to a pointer (*string, *int64) so they can hold nil. We did not do that. Wire-model churn is contagious, and the API has six handlers and four SDK clients that would all need to grow nullable types overnight.

Instead we pushed the decision down to SQL, where it costs nothing:

queries.go
SELECT date, open, high, low, close, last, prev_close,
       volume, turnover,
       COALESCE(trades, 0) AS trades   -- empty in pre-2012 NSE rows
FROM read_parquet('r2://tej-bazaar/nse/**/*.parquet', hive_partitioning=1)
WHERE symbol = ?
  AND year BETWEEN ? AND ?
  AND date BETWEEN ? AND ?
ORDER BY date

trades becomes 0 instead of null. isin and name become the empty string instead of null. The Go scan succeeds. Old clients see what they always saw. New clients reading a 2010 row will see an empty-string ISIN, which is a faithful signal that the original NSE bhavcopy never carried one. We document this on the docs page.

Corporate actions, same date range

The bhavcopy gives you OHLCV. To use those prices for backtesting or factor research you also need the corporate action stream so you can back-adjust. NSE publishes dividends, splits, bonuses and rights through a separate JSON feed, and the feed goes back to the early 2000s. We pulled it for every year 2010 through 2026.

backfill.sh
for y in $(seq 2010 2026); do
  tej-bazaar actions fetch --year $y -e nse
done

Result: 17 annual parquets, 100% ISIN match on every NSE year. A 2010 dividend on RELIANCE now shows up in /v1/actions/RELIANCE next to the 2026 one:

curl.sh
$ curl -s https://api.tejhq.dev/v1/actions/RELIANCE | jq '.data[-1]'
{
  "exchange":    "NSE",
  "symbol":      "RELIANCE",
  "isin":        "INE002A01018",
  "company":     "Reliance Industries Limited",
  "ex_date":     "2010-05-10T00:00:00Z",
  "type":        "dividend",
  "cash_amount": 7,
  "raw_subject": "Dividend Rs 7 Per Share"
}

Where BSE stands

BSE is a different story. The SEBI CMTS cutover hit BSE on the same day as NSE, but BSE's legacy format used numeric scrip codes (500325 for Reliance) rather than the alphanumeric tickers (RELIANCE) used in the modern format. There is no clean bridge: a backfilled BSE row with symbol 500325 and a modern row with symbol RELIANCE cannot be joined without a per-ISIN crosswalk that has its own gaps.

We scoped BSE to 2024-07-08 onward: every row uses the same modern alphanumeric ticker, every join is honest. BSE corp actions, where the source feed is cleaner, are backfilled for 2024, 2025 and 2026.

What you can do with it today

Sixteen years of NSE end-of-day, two years of BSE end-of-day, all corporate actions for both, in three places:

  • API: api.tejhq.dev, free tier, no auth. OpenAPI reference at /docs.
  • Parquet on R2: r2://tej-bazaar/<exchange>/year=YYYY/month=MM/*.parquet, hive-partitioned, read directly with DuckDB or Polars.
  • HuggingFace: tejhq/indian-markets, mirrored nightly by the same cron.

Next on the roadmap: derived metrics (returns, rolling windows) backfilled across the full 16-year NSE history, and a hosted query playground so you can pull a 10-year backtest series without leaving the docs page. If that sounds useful, the issue tracker at github.com/tejhq/tej-api is the place to nudge us.