Shipping back-adjusted Indian equity prices, reconciled vs Yahoo
Splits, bonuses, dividends, ISIN drift across mergers. How TejHQ Phase 4 actually shipped: the adjustment math, the per-ISIN reverse cumprod, the post-merger ISIN resolver, and the 89% Yahoo reconciliation.
Last quarter we published an RFC for how TejHQ would handle corporate actions. This is the follow-up: Phase 4 is shipped. The bazaar cron now publishes corporate actions, back-adjusted prices, and a per-ISIN symbol-history layer alongside raw bhavcopy. Every night, for both NSE and BSE.
This post is the build log. What we shipped, where the math gets weird, the one bug that ate two weeks (post-merger ISIN drift), and how the adjusted closes hold up when you put them next to Yahoo Finance across 25,000 daily comparisons.
What ships, every night
Bazaar now writes five parquet trees per cron run. Three of them are new in Phase 4:
actions/<ex>_<YYYY>.parquet- one row per corporate event, classified.prices_adjusted/<ex>_<YYYY>.parquet- raw bhavcopy plusadj_closeandadj_factor_cumulative.symbol_history/<ex>.parquet- per-ISIN intervals of which symbol traded when.
Bhavcopy (nse/, bse/) and metrics (metrics/<ex>_<YYYY>.parquet, Phase 5) round out the five. Everything is published under one HuggingFace dataset,tejhq/indian-markets, partitioned for cheap range queries in DuckDB.
The action types we classify
NSE and BSE both expose corporate action feeds. Both are messy. NSE returns JSON gated behind a browser-header dance. BSE returns clean JSON but ISINs require a separate scrip-master lookup. We normalize both into a single 11-column schema, with a type field constrained to nine values:
# Final classified type set
ACTION_TYPES = (
"dividend", # cash dividend on ex-date
"split", # face value change, e.g. 10 -> 1
"bonus", # N:M bonus issue
"rights", # rights at a discount price
"buyback", # tender / open-market buyback
"demerger", # parent splits into multiple entities
"merger", # entity absorbed into another (history continues under new ISIN)
"agm", # annual general meeting (kept for audit, no price impact)
"other", # un-classified subjects (kept verbatim in raw_subject)
)The agm and other buckets exist because the source feeds publish more than just price-affecting events, and dropping them silently makes downstream reconciliation harder. We keep the verbatim subject line in raw_subject for audit.
The adjustment math (what actually runs)
For each price-affecting event we compute a multiplicative factor, then back-adjust every prior close by the cumulative product of all future factors. The three formulas:
# Split / face-value change (e.g. fv 10 -> 1 means 1:10 split)
factor = face_value_to / face_value_from
# Bonus N:M (N new shares for every M held; e.g. 1:1 -> 1 new for 1 held)
factor = m / (n + m)
# Cash dividend D on prior session close C
factor = (c - d) / c
# Back-adjusted close: every close before the ex-date is multiplied by
# the cumulative product of all factors with ex_date > date, per ISIN.
adj_close = close * adj_factor_cumulativeWe use NSE's official dividend formula ((prev_close - cash) / prev_close) rather than Yahoo's CRSP-style (1 - cash / close_on_ex_date). Same answer to within rounding for most names; the difference compounds visibly for dividend-heavy stocks. More on that under reconciliation.
How the back-adjust is implemented
Naive approach: for each row, walk forward through actions, multiply. Quadratic in actions; fine for one symbol, painful across 4,500 symbols × 600 trading days. What actually runs:
- Partition by ISIN with
polars.partition_by. Adjustments are per-instrument. - Sort actions descending by ex_date per ISIN, compute reverse cumulative product of factors.
- numpy
searchsortedagainst the action ex-dates to assign each price row its cumulative factor in O(log n).
End-to-end re-adjust of all years for both exchanges (the cron default when a future action lands) runs in roughly 18 seconds on a laptop. The bottleneck is parquet read, not the math.
The bug that ate two weeks: ISIN drift
We assumed ISIN was a stable join key. It mostly is. The exceptions are the names that move the index: HDFCBANK, KOTAKBANK, BAJFINANCE, SHRIRAMFIN. Each of these went through a merger or restructure in the coverage window, and NSE's corporate-action feed continues to tag actions to the legacy ISIN for months after the bhavcopy itself has switched to the new ISIN.
First symptom: HDFCBANK adjusted close diverged from Yahoo by ~7% over a 4-month window. The split factor was correct, the dividend formula was correct, but actions on the legacy HDFC ISIN were never getting applied to HDFCBANK rows because the join key didn't match.
Fix: a per-ISIN symbol-history table built from the price series itself. For every (ISIN, symbol) pair we record the contiguous interval it traded as, then resolve every action's ISIN forward through merger chains before joining. The resolver doubles as a public dataset (symbol_history/), because anyone backtesting on Indian equities hits this problem eventually.
# Pseudocode for the resolver
def resolve_isin(action_isin: str, ex_date: date, history: SymbolHistory) -> str:
"""
Map an action's ISIN (which may be stale post-merger) to the
ISIN actually trading on ex_date for the same underlying symbol.
"""
symbol = history.lookup_symbol(isin=action_isin, on_date=ex_date)
if symbol is None:
return action_isin # no history; trust the source
return history.lookup_current_isin(symbol=symbol, on_date=ex_date)Reconciliation: 89% within ±1% of Yahoo
Adjustment is only believable if you can show it next to a known-good source. We ran the top 50 NSE names by mean daily turnover against Yahoo's Adj Close over the full coverage window (2024-01-01 to 2026-05-06), 48 of them with usable Yahoo data, ~25,000 daily comparisons total.
- 89% of comparisons within ±1%. The headline number.
- Splits and bonuses match within ±1% on the day after the event for every symbol we checked. The remaining gap is dividends, not split math.
- The systematic gap is methodology, not a bug. NSE uses
(prev_close - cash) / prev_close; Yahoo's CRSP uses1 - cash / close_on_ex_date. Identical to within rounding except when the dividend is large relative to close, which compounds for dividend-heavy names like INFY, TCS, HINDUNILVR over multiple years.
Both formulas are defensible. We picked NSE's because (a) it matches what the exchange publishes for its own indices, and (b) it's recoverable from public data we already ingest. Yahoo's requires the ex-date close, which is itself adjusted by the dividend, creating a chicken-and-egg with the cumulative factor.
Reproducing the sweep is one CLI command. We keep yfinance out of the pipeline package (it drags in pandas, which we deliberately avoid in core ingest) and install it via an optional extra:
pip install -e ".[reconcile]"
python scripts/reconcile_yahoo_sweep.py \
--top 50 \
--from 2024-01-01 \
--to 2026-05-06 \
--tolerance 1.0What we decided on the three open questions
The RFC left three questions open. The answers, briefly:
- Adjust backward, not forward. Today's close is today's close; history scales to match. Matches Bloomberg/Refinitiv convention and what every Python backtester expects.
- Dividends gross of DDT. Match the announced cash amount. Pre-2020 DDT-era differences are documented in metadata but not back-applied.
- Mergers: continue history under the surviving symbol, with the symbol-history table preserving the lineage. Both sides of a merger remain queryable by ISIN through their last trading day; the surviving entity's adjusted series picks up from the merger date forward.
How to use it today
import polars as pl
from huggingface_hub import hf_hub_download
# Pull the full adjusted-prices file for NSE 2025.
p = hf_hub_download(
"tejhq/indian-markets",
"prices_adjusted/NSE_2025.parquet",
repo_type="dataset",
)
df = pl.read_parquet(p)
# Continuous through HDFCBANK's merger, splits, bonuses, dividends.
df.filter(pl.col("symbol") == "HDFCBANK").select(
"date", "close", "adj_close", "adj_factor_cumulative"
).tail(10)Or join actions and adjusted in DuckDB to plot drawdowns alongside the events that caused them:
SELECT p.date, p.symbol, p.adj_close,
a.type, a.cash_amount, a.ratio_num, a.ratio_den
FROM read_parquet(
'hf://datasets/tejhq/indian-markets/prices_adjusted/NSE_*.parquet'
) p
LEFT JOIN read_parquet(
'hf://datasets/tejhq/indian-markets/actions/NSE_*.parquet'
) a
ON a.isin = p.isin
AND a.ex_date = p.date
WHERE p.symbol = 'RELIANCE'
AND p.date >= '2024-01-01'
ORDER BY p.date;What's next
Phase 5 (derived metrics: returns, 52w hi/lo, rolling volume) shipped alongside Phase 4 and rides on the adjusted close. The next real piece of work is Phase 3.5, pre-2024 historical coverage. The legacy NSE and BSE bhavcopy formats are different files with different columns and a different cadence; once that lands, the adjustment engine extends backward unchanged.
After that, fundamentals. ISIN-keyed quarterly filings joined to the same five trees. That's where TejHQ becomes more than a price feed.
If you find a divergence
The reconciliation script is in the repo. If you find an adjusted close that disagrees with NSE's own historical chart by more than the documented Yahoo methodology gap, please open an issue on github.com/tejhq/tej-bazaar with the symbol and date. We treat divergences as production bugs.