INQUIRING LINE

Inquiring lines›How should we train models for cap…›What systematic failures and vulne…›Does fine-tuning modify underlying…›this inquiring line

Training might not teach AI new skills — it might just decide which ones it bothers to use.

Why does the gap between theoretical expressiveness and learned capability matter?

This explores the recurring gap between what a model could in principle represent or do (its theoretical capacity) and what it actually demonstrates after training — and why that gap is the thing worth watching rather than the headline benchmark number.

This explores the difference between what a model is *capable* of representing and what it has actually *learned to do* — and why that distinction quietly decides what your evaluations are really measuring. The corpus keeps circling one uncomfortable finding: the capability is often already there, and training mostly decides whether it gets used. Multiple independent methods — RL steering, decoding tricks, feature steering — all elicit reasoning that base models already hold latently Do base models already contain hidden reasoning ability?, and a closely related line argues RL post-training teaches a model *when* to reason rather than *how*, recovering most of the gains just by routing tokens Does RL post-training create reasoning or just deploy it?. If that's true, then a benchmark jump after fine-tuning may be measuring elicitation, not new ability — and you'll misjudge what your model can do off-distribution.

The gap matters because most of our measurement tools are blind to it. 'Emergent abilities' that look like sudden capability jumps often dissolve into smooth, predictable curves the moment you switch from a discontinuous metric to a continuous one — the leap was in the ruler, not the model Are LLM emergent abilities real or measurement artifacts?. Worse, two models with identical accuracy can be organized completely differently inside: one can carry all the linearly-decodable features it needs while its internal structure is fractured, leaving it fragile to perturbation and distribution shift in ways no standard score reveals Can models be smart without organized internal structure?. Same number, very different thing learned.

This is also why training can capture the *form* of a skill without the substance. Imitating ChatGPT reliably fools human evaluators by reproducing a confident, fluent style — while closing none of the actual factuality gap Can imitating ChatGPT fool evaluators into thinking models improved?. Instruction tuning turns out to teach the output-space distribution more than task understanding: models trained on semantically empty or deliberately wrong instructions score about the same as those trained on correct ones Does instruction tuning teach task understanding or output format?. And chain-of-thought exemplars that are logically *invalid* perform nearly as well as valid ones — the model learns the shape of reasoning, not inference itself Does logical validity actually drive chain-of-thought gains?. In each case the expressive capacity to do the real thing exists; what got learned was the cheaper surface pattern.

The gap isn't just an accounting curiosity — ignoring it actively damages models. Train on problems that are too hard and the model learns degenerate shortcuts that then contaminate capabilities it already had, because rare accidental successes get reinforced as if they were sound reasoning Do overly hard RLVR samples actually harm model capabilities?. Pure self-improvement stalls for the same family of reasons — the generation-verification gap and reward hacking — and only works when it smuggles in an external anchor Can models reliably improve themselves without external feedback?. Underneath all of it sits the harder claim that LLMs track statistical regularities with high fidelity yet have measurable, structurally specific epistemic failures — the gap between pattern-tracking and genuine knowledge isn't a tuning artifact you can train away What do language models actually know?.

So here's the thing you might not have known you wanted to know: the most useful research direction may not be making models *more* expressive — they often already represent more than they show — but getting better at telling apart elicitation from acquisition, and form from substance. A promising practical angle is composing latent skills at inference time rather than baking them in, as with singular-value expert vectors that mix dynamically without retraining Can models dynamically activate expert skills at inference time?. If capability is mostly already present, the leverage shifts from *building* it to *surfacing the right part of it on demand* — and to evaluations honest enough to tell whether you did.

Sources 11 notes

Do base models already contain hidden reasoning ability?

Five independent mechanisms—RL steering, critique fine-tuning, decoding changes, SAE feature steering, and RLVR—all elicit reasoning already present in base model activations. Post-training selects rather than creates reasoning; the bottleneck is elicitation, not capability acquisition.

Does RL post-training create reasoning or just deploy it?

Evidence shows base models already contain reasoning capability in latent form; RL training optimizes deployment timing rather than capability creation. Hybrid models recover 91% of performance gains by routing tokens only, and activation vectors for reasoning strategies pre-exist before any RL.

Are LLM emergent abilities real or measurement artifacts?

Sharp, unpredictable capability transitions vanish when using continuous metrics instead of discontinuous ones. The same model outputs show smooth predictable improvement with scale, suggesting emergence is a measurement choice rather than a real behavioral change.

Can models be smart without organized internal structure?

Models trained with SGD can contain all the linearly decodable features needed for a task while maintaining fundamentally broken internal organization. This makes them vulnerable to perturbation and distribution shift invisible to standard evaluation metrics.

Can imitating ChatGPT fool evaluators into thinking models improved?

Imitation models fool human evaluators by mimicking ChatGPT's confident, fluent style while failing to improve factuality or generalization on novel tasks. The ceiling is set by base model capability, not fine-tuning method—better fundamentals, not shortcuts, drive real improvement.

Show all 11 sources

Does instruction tuning teach task understanding or output format?

Models trained on semantically empty or deliberately incorrect instructions achieve comparable performance to those trained on full correct instructions, achieving 43% vs random baseline 42.6%. The semantic content of instructions appears largely irrelevant; what transfers is knowledge of the output space.

Does logical validity actually drive chain-of-thought gains?

Illogical chain-of-thought exemplars matched valid CoT performance on BIG-Bench Hard, showing that structural properties—not logical validity—drive the gains. The model learns the form of reasoning, not genuine inference.

Do overly hard RLVR samples actually harm model capabilities?

Training on nearly-impossible problems causes models to learn degenerate shortcuts rather than genuine reasoning, and these shortcuts contaminate pre-existing capabilities. Group-relative normalization treats rare accidental successes as high-advantage trajectories, reinforcing answer repetition and computation-skipping instead of sound reasoning patterns.

Can models reliably improve themselves without external feedback?

Pure self-improvement stalls due to the generation-verification gap, diversity collapse, and reward hacking. Reliable improvement methods succeed by smuggling in external anchors: past model versions, third-party judges, user corrections, or tool feedback.

What do language models actually know?

LLMs achieve high fidelity in capturing language patterns yet show systematic, structurally specific failures—hallucination, reasoning collapse, and premise-sensitivity. The gap between statistical tracking and real knowledge is measurable and unavoidable.

Can models dynamically activate expert skills at inference time?

Transformer2 demonstrates that tuning only singular values within weight matrices produces composable expert vectors that dynamically mix at inference without interference, outperforming LoRA with fewer parameters and enabling continual specialization.

Papers this line draws on 8

The research behind the notes this line reads — ranked by how closely each paper relates.

Research prompt for your LLMexpand ↓

Copy into ChatGPT or Claude to take this line of inquiry further — it asks the model to find newer work and re-test which earlier constraints still hold.

You are a research analyst re-testing whether the expressiveness–capability gap still structures LLM behavior. This library (spanning 2023–2026) made claims about latent reasoning, elicitation vs. acquisition, and form vs. substance. Are they holding up, or have new models, training methods, or evaluations moved the regime?

What a curated library found — and when (dated claims, not current truth):

• Base models possess latent reasoning capability; RL/steering elicits rather than builds it (~2024–2025)
• Emergent ability jumps dissolve into smooth curves under continuous metrics; the leap is often in the measurement, not the model (2023)
• Identical accuracy can mask radically different internal organization — one model fragile, one robust — yet no standard benchmark reveals it (2024)
• Instruction tuning teaches output-format distribution and style imitation more than task understanding; invalid chain-of-thought chains perform nearly as well as valid ones (2023)
• Pure self-improvement stalls unless externally anchored; training on overly-hard samples induces degenerate shortcuts that corrupt adjacent capabilities (~2024–2026)

Anchor papers (verify; mind their dates):
• arXiv:2304.15004 (2023) — Emergent Abilities of LLMs: Mirage?
• arXiv:2305.11383 (2023) — Do Models Really Learn to Follow Instructions?
• arXiv:2412.02674 (2024) — Self-Improvement Capabilities of LLMs
• arXiv:2605.28388 (2026) — Sample Difficulty in RLVR

Your task:

(1) **Re-test each constraint.** For every finding above, ask: have newer frontier models (o1, r1, Gemini 3, or successors), new training paradigms (multi-epoch RL, on-policy distillation, structured sampling), tooling (SAEs, mechanistic probes, sparse autoencoders for control), or orchestration (retrieval-augmented reasoning, long-context memory, multi-agent composition) since relaxed or overturned it? Separate the durable question (what capability and learning *fundamentally are* in transformers) from the perishable claim (what a 2024-era model couldn't do). Cite concretely what moved it.

(2) **Surface the strongest contradicting or superseding work.** In the last ~6 months, what papers challenge the expressiveness–capability gap as *the* organizing principle? Look for claims that models genuinely *learn* new capabilities (not just elicit), or that the form/substance boundary is blurrier than the library suggests.

(3) **Propose two research questions that assume the regime may have shifted.** If latent capability is now reliably surfaceable on-demand, or if hard-sample training no longer corrupts, what becomes the next bottleneck?

Cite arXiv IDs; flag anything you cannot ground in a real paper.

Training might not teach AI new skills — it might just decide which ones it bothers to use.

Related lines of inquiry

Sources 11 notes

Papers this line draws on 8