What substrate do supervised models lack that makes them weaker on low-resource languages?
This reads 'substrate' as the underlying representational grounding — not more labeled examples — that models lack for low-resource languages: the corpus suggests the missing thing is dense, first-class internal representation, which supervised data quantity can't manufacture.
This explores what's actually missing when models stumble on low-resource languages — and the corpus points away from the obvious answer ('not enough training data') toward something more structural: low-resource languages and cultures never get their own representational substrate inside the model. The most direct evidence comes from mechanistic interpretability showing that low-resource cultures like Ethiopia and Algeria are internally routed *through* high-resource proxies — the model doesn't represent them directly, it represents them as a distortion of dominant cultures, and this persists even when the surface output looks correct Do LLMs represent low-resource cultures through dominant cultural proxies?. The substrate isn't thin; for the low-resource case it's borrowed. That's a different failure than a knowledge gap — it's a representational one.
The same pattern shows up in a domain that has nothing to do with language per se. On historical legal cases, models do worse not because the cases are harder but because the training corpus over-represents recent material, leaving 'shallower representations' of older precedent Why do language models struggle with historical legal cases?. Swap 'historical era' for 'low-resource language' and you have the same mechanism: whatever is under-represented in the corpus gets a thin, shallow internal encoding, and the model falls back on whatever dense representation is nearest. Under-representation doesn't just lower accuracy — it changes *how* the thing is stored.
This connects to a deeper finding about what supervised models learn in the first place. Models routinely pass grammatical tests by leaning on surface cues — sentence length, word choice, orthography — rather than the underlying structural rules Can models pass tests while missing the actual grammar?, and their competence degrades predictably as structural complexity rises Does LLM grammatical performance decline with structural complexity?, Why do large language models fail at complex linguistic tasks?. For a high-resource language, surface heuristics are abundant enough to paper over the missing deep structure. For a low-resource language, there simply aren't enough surface patterns to lean on — so the absence of genuine structural grounding gets exposed. The substrate that's lacking is the same one that's lacking everywhere; it's just visible here.
There's also a clue about *why supervised training specifically* hits this wall. When small models are trained with plain supervised fine-tuning, they underperform on rigid, low-frequency output patterns — and adding explicit negative examples (DPO) is what fixes it, because SFT alone only ever reinforces the dense, high-frequency patterns it already sees a lot of Can small models match large models on function calling?. Supervised learning is biased toward whatever is abundant. And once a strong parametric prior exists, models will override contradicting in-context evidence with it Why do language models ignore information in their context? — meaning even feeding a low-resource language at inference time doesn't reliably correct a model whose internal substrate was built around the high-resource majority.
The thing you didn't know you wanted to know: the corpus reframes 'low-resource' as a representational state, not a data-quantity state. The fix implied across these notes isn't 'add more examples' — it's that statistical, surface-pattern learning never builds the structural substrate in the first place, and low-resource languages are simply where that absence stops being hideable.
Sources 7 notes
Mechanistic interpretability analysis reveals that low-resource cultures like Ethiopia and Algeria are structurally represented through high-resource cultural proxies in internal model states, not just output. This architectural bias persists even when models can produce correct surface-level answers.
Supreme Court overruling benchmark (236 pairs) reveals era sensitivity: models perform worse on historical cases than modern ones. Root cause is training corpus over-representation of recent cases, creating shallower representations of older precedent.
BabyLM evaluations showed models can produce correct outputs by relying on sentence length, word choice, and orthography rather than grammatical structure. Standard benchmarks cannot distinguish these two generalization types without tests specifically designed to rule out surface heuristics.
LLMs show systematic performance decline as syntactic depth and embedding increase. Simple sentences are handled well while complex structures with recursion and embedding fail consistently, suggesting LLMs learned surface heuristics rather than structural grammar rules.
Top-tier LLMs like Llama3-70b consistently misidentify embedded clauses, verb phrases, and complex nominals. Performance degrades predictably as syntactic depth increases, revealing that statistical learning captures surface patterns but not deep grammatical rules.
Small models fine-tuned via DPO on correct and incorrect function-calling examples from a large teacher model achieve high accuracy on logical and mathematical tasks. DPO's explicit negative examples directly target the rigid output format failures where SFT alone underperforms.
Research demonstrates that LMs generate outputs inconsistent with their context because parametric knowledge from training dominates over in-context information. Textual prompting alone cannot override strong priors; causal intervention in representations is required.