INQUIRING LINE

Model Architecture and Internals · Reasoning, Retrieval, and Evaluation · Training, RL, and Test-Time Scalingcross-cluster

Why does natural language contain redundancy humans need but models don't?

This explores why language is full of repetition, grammatical scaffolding, and 'filler' that humans seem to rely on, yet models can strip away with little loss — and what that asymmetry reveals about how each side actually uses words.

This explores why language is padded with redundancy — repetition, grammatical scaffolding, hedges, connective filler — that humans appear to need but models can shed. The corpus suggests the answer is that humans and models are doing two different jobs with the same tokens: humans preserve redundancy because it carries contextual meaning for situated action, while models treat it as compressible overhead. The sharpest evidence is the finding that LLMs prioritize aggressive statistical compression while humans preserve adaptive nuance Do LLMs compress concepts more aggressively than humans do?. Run through the lens of rate-distortion theory, models capture broad category structure efficiently but discard the fine-grained distinctions humans hold onto — humans deliberately trade compression away for meaning, which is exactly what 'redundancy' looks like from a pure information-theoretic standpoint.

You can watch this play out token by token inside reasoning chains. When models rank which words actually matter, symbolic-computation tokens get preserved first while grammar and meta-discourse get pruned earliest — and students trained on those stripped-down chains can do *better* than ones trained on fluent frontier-model text Which tokens in reasoning chains actually matter most?. So the connective tissue of natural language is, to a model, largely deletable scaffolding. A related result shows transformers often compute the correct answer in their early layers and then spend the final layers producing format-compliant filler — the fluent surface is downstream of, and separable from, the actual computation Do transformers hide reasoning before producing filler tokens?.

The twist is that this 'redundancy is overhead' picture is also where models are quietly weak. They learn the surface statistics of grammar without the deep rules: top models reliably misparse embedded clauses and complex nominals, and fail more as syntactic depth increases Why do large language models fail at complex linguistic tasks?. So the grammatical redundancy a model discards as filler is partly the very structure it never fully internalized. Humans use that redundancy as error-correction and disambiguation — it's how a noisy channel stays robust. The model doesn't 'need' it only in the sense that it isn't using it the way we are.

There's a deeper framing here worth knowing about: a model is an autoregressive probability machine, and tasks fail predictably when the target output is low-probability, even when it's logically trivial Can we predict where language models will fail?. Redundancy in human language is, in part, low-information-content material — high-probability, predictable, and therefore cheap for a next-token predictor to either generate fluently or drop entirely without changing the high-probability core. What's effortful for humans (precision) and what's effortful for models (low-probability targets) are different axes, which is why the same redundant phrase can be load-bearing for one and disposable for the other.

The thing you didn't know you wanted to know: the redundancy gap isn't really about language at all — it's about what compression is *for*. Models optimize for statistical efficiency because that's their objective; humans 'waste' bits on redundancy because those bits do contextual, social, and error-correcting work that only matters when language has to survive contact with a messy world. Strip the redundancy and a model is often fine; strip it from human communication and you lose the margin that lets meaning hold up under noise Do LLMs compress concepts more aggressively than humans do?.

Sources 5 notes

Do LLMs compress concepts more aggressively than humans do?

Using Rate-Distortion Theory on cognitive datasets, LLMs capture broad category structure but lose fine-grained distinctions humans preserve. LLMs maximize compression efficiency; humans trade compression for contextual meaning that enables situated action.

Which tokens in reasoning chains actually matter most?

Greedy likelihood-preserving pruning reveals six functional token categories; symbolic computation tokens are preferentially preserved while grammar and meta-discourse are pruned first. Student models trained on these pruned chains outperform those trained on frontier-model compression.

Do transformers hide reasoning before producing filler tokens?

Logit lens analysis shows models trained with hidden CoT tokens compute correct answers in layers 1-3, then actively suppress these representations in final layers to produce format-compliant filler output. The reasoning is fully recoverable from lower-ranked token predictions.

Why do large language models fail at complex linguistic tasks?

Top-tier LLMs like Llama3-70b consistently misidentify embedded clauses, verb phrases, and complex nominals. Performance degrades predictably as syntactic depth increases, revealing that statistical learning captures surface patterns but not deep grammatical rules.

Can we predict where language models will fail?

By framing LLMs as autoregressive probability machines, researchers predicted tasks with low-probability target responses would be systematically harder, even when logically simple. Experiments confirmed predictions like backwards alphabet and letter counting.

Why does natural language contain redundancy humans need but models don't?

Sources 5 notes

Next inquiring lines