Why does the gap between theoretical expressiveness and learned capability matter?
This explores the recurring gap between what a model could in principle represent or do (its theoretical capacity) and what it actually demonstrates after training — and why that gap is the thing worth watching rather than the headline benchmark number.
This explores the difference between what a model is *capable* of representing and what it has actually *learned to do* — and why that distinction quietly decides what your evaluations are really measuring. The corpus keeps circling one uncomfortable finding: the capability is often already there, and training mostly decides whether it gets used. Multiple independent methods — RL steering, decoding tricks, feature steering — all elicit reasoning that base models already hold latently Do base models already contain hidden reasoning ability?, and a closely related line argues RL post-training teaches a model *when* to reason rather than *how*, recovering most of the gains just by routing tokens Does RL post-training create reasoning or just deploy it?. If that's true, then a benchmark jump after fine-tuning may be measuring elicitation, not new ability — and you'll misjudge what your model can do off-distribution.
The gap matters because most of our measurement tools are blind to it. 'Emergent abilities' that look like sudden capability jumps often dissolve into smooth, predictable curves the moment you switch from a discontinuous metric to a continuous one — the leap was in the ruler, not the model Are LLM emergent abilities real or measurement artifacts?. Worse, two models with identical accuracy can be organized completely differently inside: one can carry all the linearly-decodable features it needs while its internal structure is fractured, leaving it fragile to perturbation and distribution shift in ways no standard score reveals Can models be smart without organized internal structure?. Same number, very different thing learned.
This is also why training can capture the *form* of a skill without the substance. Imitating ChatGPT reliably fools human evaluators by reproducing a confident, fluent style — while closing none of the actual factuality gap Can imitating ChatGPT fool evaluators into thinking models improved?. Instruction tuning turns out to teach the output-space distribution more than task understanding: models trained on semantically empty or deliberately wrong instructions score about the same as those trained on correct ones Does instruction tuning teach task understanding or output format?. And chain-of-thought exemplars that are logically *invalid* perform nearly as well as valid ones — the model learns the shape of reasoning, not inference itself Does logical validity actually drive chain-of-thought gains?. In each case the expressive capacity to do the real thing exists; what got learned was the cheaper surface pattern.
The gap isn't just an accounting curiosity — ignoring it actively damages models. Train on problems that are too hard and the model learns degenerate shortcuts that then contaminate capabilities it already had, because rare accidental successes get reinforced as if they were sound reasoning Do overly hard RLVR samples actually harm model capabilities?. Pure self-improvement stalls for the same family of reasons — the generation-verification gap and reward hacking — and only works when it smuggles in an external anchor Can models reliably improve themselves without external feedback?. Underneath all of it sits the harder claim that LLMs track statistical regularities with high fidelity yet have measurable, structurally specific epistemic failures — the gap between pattern-tracking and genuine knowledge isn't a tuning artifact you can train away What do language models actually know?.
So here's the thing you might not have known you wanted to know: the most useful research direction may not be making models *more* expressive — they often already represent more than they show — but getting better at telling apart elicitation from acquisition, and form from substance. A promising practical angle is composing latent skills at inference time rather than baking them in, as with singular-value expert vectors that mix dynamically without retraining Can models dynamically activate expert skills at inference time?. If capability is mostly already present, the leverage shifts from *building* it to *surfacing the right part of it on demand* — and to evaluations honest enough to tell whether you did.
Sources 11 notes
Five independent mechanisms—RL steering, critique fine-tuning, decoding changes, SAE feature steering, and RLVR—all elicit reasoning already present in base model activations. Post-training selects rather than creates reasoning; the bottleneck is elicitation, not capability acquisition.
Evidence shows base models already contain reasoning capability in latent form; RL training optimizes deployment timing rather than capability creation. Hybrid models recover 91% of performance gains by routing tokens only, and activation vectors for reasoning strategies pre-exist before any RL.
Sharp, unpredictable capability transitions vanish when using continuous metrics instead of discontinuous ones. The same model outputs show smooth predictable improvement with scale, suggesting emergence is a measurement choice rather than a real behavioral change.
Models trained with SGD can contain all the linearly decodable features needed for a task while maintaining fundamentally broken internal organization. This makes them vulnerable to perturbation and distribution shift invisible to standard evaluation metrics.
Imitation models fool human evaluators by mimicking ChatGPT's confident, fluent style while failing to improve factuality or generalization on novel tasks. The ceiling is set by base model capability, not fine-tuning method—better fundamentals, not shortcuts, drive real improvement.
Models trained on semantically empty or deliberately incorrect instructions achieve comparable performance to those trained on full correct instructions, achieving 43% vs random baseline 42.6%. The semantic content of instructions appears largely irrelevant; what transfers is knowledge of the output space.
Illogical chain-of-thought exemplars matched valid CoT performance on BIG-Bench Hard, showing that structural properties—not logical validity—drive the gains. The model learns the form of reasoning, not genuine inference.
Training on nearly-impossible problems causes models to learn degenerate shortcuts rather than genuine reasoning, and these shortcuts contaminate pre-existing capabilities. Group-relative normalization treats rare accidental successes as high-advantage trajectories, reinforcing answer repetition and computation-skipping instead of sound reasoning patterns.
Pure self-improvement stalls due to the generation-verification gap, diversity collapse, and reward hacking. Reliable improvement methods succeed by smuggling in external anchors: past model versions, third-party judges, user corrections, or tool feedback.
LLMs achieve high fidelity in capturing language patterns yet show systematic, structurally specific failures—hallucination, reasoning collapse, and premise-sensitivity. The gap between statistical tracking and real knowledge is measurable and unavoidable.
Transformer2 demonstrates that tuning only singular values within weight matrices produces composable expert vectors that dynamically mix at inference without interference, outperforming LoRA with fewer parameters and enabling continual specialization.