Can domain pretraining on historical legal corpora reduce era sensitivity?
This explores whether training a model specifically on old legal texts could fix the way LLMs reason worse about historical court cases than modern ones — and the corpus suggests the answer is more tangled than a simple 'add more old data.'
This explores whether domain pretraining on historical legal corpora could cure "era sensitivity" — the documented tendency of LLMs to reason worse about older cases than recent ones. The starting point is the diagnosis itself: a Supreme Court overruling benchmark shows models systematically underperform on historical precedent, and the named root cause is that training corpora over-represent recent cases, leaving shallow representations of older law Why do language models struggle with historical legal cases?. Read that way, the question almost answers itself — if the gap comes from missing historical data, feeding the model historical data should help. But the corpus has a lot to say about why that intuition is only half right.
The most direct support comes from a hard ceiling argument: prompt-level tricks cannot inject knowledge a model never absorbed, they can only reactivate what's already in the training distribution Can prompt optimization teach models knowledge they lack?. That cuts both ways. It means you genuinely can't prompt your way out of era sensitivity — the fix has to happen upstream, in pretraining or fine-tuning, exactly as the question proposes. So domain pretraining is the right *layer* to intervene at, not a band-aid.
The catch is that domain adaptation is not free. Research across adaptation methods finds every technique has a "domain-conditional sweet spot" with hidden costs — visible gains in one area often come paired with quiet degradation in reasoning faithfulness, capability transfer, and format flexibility How do domain training techniques actually reshape model behavior?. So pretraining on historical legal text might lift historical-case accuracy while subtly eroding how the model handles modern cases or transfers reasoning across eras. The era gap could narrow on the benchmark while the model's overall legal reasoning gets more brittle — a trade the headline number wouldn't reveal.
There's also a sharper lesson about *how* to add the historical knowledge. The strongest result on deep domain expertise didn't come from dumping more raw text — it came from composing structured reasoning tasks out of a knowledge graph, where curated compositional paths beat scale Can knowledge graphs teach models deep domain expertise?. For law, where precedent is inherently a graph of citations and overrulings, that hints the cure for era sensitivity may be less about volume of old cases and more about teaching the *relationships* between old and new doctrine. And if your historical corpus is itself degraded — OCR errors, archaic language drift, the realities of old documents — work on noisy historical newspapers shows that aggressive retrieval paired with grounded refusal preserves integrity better than trusting the raw text Can RAG systems refuse to answer without reliable evidence?.
So: yes, domain pretraining targets the real cause and prompting can't substitute for it — but the corpus reframes the goal. The win isn't "more historical tokens," it's structured, relationship-aware exposure that closes the era gap without quietly trading away the modern-case competence and reasoning faithfulness you started with.
Sources 5 notes
Supreme Court overruling benchmark (236 pairs) reveals era sensitivity: models perform worse on historical cases than modern ones. Root cause is training corpus over-representation of recent cases, creating shallower representations of older precedent.
Prompting works entirely within a model's pre-existing training distribution and cannot supply domain knowledge absent from training data. This creates a hard ceiling: no prompt strategy can compensate for missing foundational knowledge, only reorganize what already exists.
Research shows every adaptation method—from parameter-efficient tuning to knowledge graph curricula—has optimal conditions tied to specific domains. The key finding: visible benefits like performance gains often come with hidden degradation in reasoning faithfulness, capability transfer, and format flexibility.
Fine-tuning a 32B model on 24,000 reasoning tasks derived from medical knowledge graph paths produces state-of-the-art performance across 15 medical domains, demonstrating that structured knowledge composition matters more than scale.
A multilingual RAG system for noisy historical newspapers succeeds by aggressively expanding retrieval while constraining generation to only grounded answers. The grounded-refusal prompt prevents hallucination when OCR errors and language drift degrade source quality, trading coverage for integrity.