INQUIRING LINE

How do corpus statistics shape the abstraction hierarchy in language model representations?

This explores where the layered, nested structure in language model representations actually comes from — and the corpus's sharpest answer is that it falls out of the raw statistics of which words appear near which, no special machinery required.


This explores where the layered, nested structure in language model representations comes from — the question assumes models build an abstraction hierarchy, and the corpus's most direct answer is that they don't build it so much as inherit it from the math of word co-occurrence. The standout finding is that hierarchical concept geometry needs no dedicated mechanism: it emerges as a mathematical consequence of corpus statistics, where a spectral analysis of which words appear together predicts and reproduces the same nested geometry found inside trained embeddings and even old word2vec models Where does hierarchical structure in language models come from?. In other words, the shape of the data writes the shape of the representation.

If statistics sculpt the hierarchy, then statistical imbalance warps it — and the corpus has several notes showing exactly that. Models build shallower, weaker representations of whatever is under-represented in training: historical legal cases get systematically worse treatment than modern ones because recent cases dominate the corpus Why do language models struggle with historical legal cases?, and low-resource cultures get routed through high-resource cultural proxies as a structural pathway inside the model, not just a surface slip Do LLMs represent low-resource cultures through dominant cultural proxies?. The abstraction hierarchy isn't a neutral ladder; its rungs are spaced by how often the data talks about something.

The same statistical origin explains a ceiling on what the hierarchy can hold. Because representations track co-occurrence rather than rules, models capture surface patterns but miss deep grammatical structure — they reliably stumble on embedded clauses and complex nominals, and the errors get predictably worse as syntactic depth increases Why do large language models fail at complex linguistic tasks?. Push further and reasoning itself turns out to ride on semantic association rather than symbolic structure: strip the familiar meaning out of a task and performance collapses even when the rules are sitting right there in context Do large language models reason symbolically or semantically?. A hierarchy assembled from word statistics is excellent at the statistically frequent and brittle at the structurally deep.

There's a useful tension here worth chasing. One line of work suggests the hierarchy can be climbed deliberately rather than just absorbed: deep-and-thin architectures beat wide ones at small scale precisely by composing abstract concepts across layers Does depth matter more than width for tiny language models?, and chain-of-thought reasoning lets a model construct genuine syntactic trees and phonological generalizations it can't produce in a single pass Can language models actually analyze language structure?. So the static geometry handed to you by corpus statistics is one thing; what extra depth or explicit reasoning steps can build on top of it is another.

The quiet payoff: the abstraction hierarchy you can interrogate inside a model is largely a fossil of its training distribution. That reframes a lot of failures — context being overridden by strong priors Why do language models ignore information in their context?, low-probability tasks like reversing the alphabet being hard for reasons that have nothing to do with logical difficulty Can we predict where language models will fail? — as not bugs in the reasoning but shadows cast by the statistics that built the representations in the first place.


Sources 9 notes

Where does hierarchical structure in language models come from?

LLM hierarchical representations arise as a direct mathematical consequence of corpus statistics, not from hierarchy-specific mechanisms. Spectral analysis of word co-occurrence matrices predicts and reproduces the same nested geometry found in trained embeddings and word2vec models.

Why do language models struggle with historical legal cases?

Supreme Court overruling benchmark (236 pairs) reveals era sensitivity: models perform worse on historical cases than modern ones. Root cause is training corpus over-representation of recent cases, creating shallower representations of older precedent.

Do LLMs represent low-resource cultures through dominant cultural proxies?

Mechanistic interpretability analysis reveals that low-resource cultures like Ethiopia and Algeria are structurally represented through high-resource cultural proxies in internal model states, not just output. This architectural bias persists even when models can produce correct surface-level answers.

Why do large language models fail at complex linguistic tasks?

Top-tier LLMs like Llama3-70b consistently misidentify embedded clauses, verb phrases, and complex nominals. Performance degrades predictably as syntactic depth increases, revealing that statistical learning captures surface patterns but not deep grammatical rules.

Do large language models reason symbolically or semantically?

When semantic content is decoupled from reasoning tasks, LLM performance collapses even with correct rules in context. Models rely on parametric commonsense and token associations rather than formal logical manipulation, constraining reasoning to training distribution semantics.

Does depth matter more than width for tiny language models?

MobileLLM shows deep-and-thin architectures yield 2.7–4.3% accuracy gains over balanced designs at 125M–350M scale by composing abstract concepts through layers rather than spreading parameters across width.

Can language models actually analyze language structure?

OpenAI's o1 model successfully constructs syntactic trees and phonological generalizations through explicit step-by-step reasoning, revealing that LLM linguistic capability extends far beyond behavioral language tasks to genuine language analysis.

Why do language models ignore information in their context?

Research demonstrates that LMs generate outputs inconsistent with their context because parametric knowledge from training dominates over in-context information. Textual prompting alone cannot override strong priors; causal intervention in representations is required.

Can we predict where language models will fail?

By framing LLMs as autoregressive probability machines, researchers predicted tasks with low-probability target responses would be systematically harder, even when logically simple. Experiments confirmed predictions like backwards alphabet and letter counting.

Research prompt for your LLMexpand ↓

Copy into ChatGPT or Claude to take this line of inquiry further — it asks the model to find newer work and re-test which earlier constraints still hold.

You are a mechanistic analyst investigating whether corpus statistics genuinely determine the abstraction hierarchy in LLM representations, or whether newer training methods, architectural innovations, or evaluation techniques have since decoupled representation geometry from co-occurrence patterns.

What a curated library found — and when (dated claims, not current truth):
Findings span 2023–2026; treat these as perishable constraints:
• Hierarchical concept geometry emerges as a mathematical consequence of word co-occurrence statistics; no dedicated mechanism needed (~2026).
• Models systematically build weaker representations of underrepresented domains: historical legal cases perform worse than modern ones, low-resource cultures routed through high-resource proxies (~2025).
• LLMs reliably fail on embedded clauses and syntactic depth; errors worsen predictably with structural complexity (~2025).
• When semantics are stripped, in-context reasoning collapses despite rules being available; models are semantic associators, not symbolic reasoners (~2023).
• Chain-of-thought and metalinguistic analysis allow models to construct syntactic structures they cannot produce in single-pass generation (~2023).

Anchor papers (verify; mind their dates):
• 2305.14825 (In-Context Semantic Reasoners, May 2023)
• 2503.19260 (Linguistic Blind Spots, March 2025)
• 2508.08879 (Entangled Representations / Cultural Biases, August 2025)
• 2605.23821 (Hierarchical Concept Geometry, May 2026)

Your task:
(1) RE-TEST EACH CONSTRAINT. For corpus-driven geometry: has architectural pre-training (e.g., explicit hierarchical induction, curriculum learning, synthetic balanced corpora, or retrieval-augmented training) since decoupled learned representations from raw co-occurrence? For domain underrepresentation: do fine-tuning, in-context examples, or adaptive routing now repair hierarchy flatness? For syntactic brittleness: do newer models (o1, reasoning variants) or explicit parse-guided decoding overcome depth limits? Separate the durable claim (representation geometry tracks training distribution) from possibly-relaxed constraints (statistical imbalance, syntactic failure).
(2) Surface the strongest CONTRADICTING or SUPERSEDING work from the last ~6 months—especially any showing hierarchies shaped by objectives OTHER than co-occurrence, or evidence that deliberate reasoning architectures can build orthogonal abstraction orders.
(3) Propose 2 research questions assuming the regime has shifted: (a) Can contrastive or causal training objectives induce hierarchies independent of corpus statistics? (b) What is the minimal corpus intervention (rebalancing, synthetic augmentation, or auxiliary objectives) needed to flip a representation's hierarchy from statistical to task-aligned?

Cite arXiv IDs; flag anything you cannot ground in a real paper.

Next inquiring lines