INQUIRING LINE

Inquiring lines›How should we train models for cap…›What systematic failures and vulne…›How do training priors constrain w…›this inquiring line

Two AI models can score identically on every test while one has clean internal wiring and the other is a tangled mess.

What is the difference between changing model outputs versus changing internal representations?

This explores a distinction the corpus keeps circling: whether a training method actually rewires how a model represents and processes information, or just reshapes the surface form of what it says — and why two models that behave identically can be built completely differently inside.

This explores a distinction the corpus keeps circling: changing what a model *says* versus changing how it *works inside* — and the surprising finding is that these come apart far more than you'd expect. The cleanest version of the insight is that identical outputs can hide radically different internal machinery. Two networks can score the same on a benchmark while one has clean, modular internal structure and the other has what researchers call 'fractured, entangled representations' — tangled wiring that happens to produce right answers but breaks the moment you ask it to transfer to a new context or recombine ideas creatively Can identical outputs hide broken internal representations?. Performance and internal structure are, in a real sense, decoupled What actually happens inside the minds of language models? What really happens inside a language model?.

Once you see that decoupling, a whole cluster of training-method findings snaps into focus — and most of them turn out to be changing outputs, not understanding. Instruction tuning, for instance, mostly teaches a model the *shape* of acceptable answers rather than the task itself: models trained on deliberately wrong or semantically empty instructions perform about as well as ones trained on correct ones, because what actually transfers is knowledge of the output space Does instruction tuning teach task understanding or output format?. The same pattern shows up in reasoning: a tiny 1.5B model with lightweight LoRA tuning matches much larger RL-trained models, suggesting the RL was teaching output *formatting* and organization, not new factual knowledge Can small models reason well by just learning output format?. RL training even tends to collapse a model down to a single dominant answer-format it already learned in pretraining, amplifying one style while suppressing alternatives Does RL training collapse format diversity in pretrained models?.

But 'just changing outputs' doesn't mean 'changing nothing inside' — it means the internal change is narrow and specific. When researchers looked at what RL actually edits, they found it sparsely updates only 5–30% of parameters, and works mainly by *suppressing* wrong trajectories rather than building new capability What actually changes inside a model during RL training?. So the internal footprint is real but surgical: it's nudging which paths get used, not laying down new knowledge.

The more interesting frontier is methods that genuinely touch internal representations — and the corpus shows these leave fingerprints you can find. Models develop real internal mechanisms for tracking whether they actually know a fact about an entity, and you can causally steer hallucination or refusal by manipulating them Do models know what they don't know?. Even more striking, the *type* of training matters for what internal capacities emerge: DPO (a preference-based method) builds a two-stage internal circuit that can detect when its own activations are being tampered with, a circuit that plain supervised fine-tuning doesn't produce — and that safety training can actively suppress How do language models detect injected steering vectors internally?. This connects to the limits of self-report: a model's spoken claims about itself mostly echo training data, with only thin genuine introspection where a causal chain links an internal state to the report Can language models actually introspect about their own states?.

The payoff for a curious reader: 'does it work?' and 'did we change how it thinks?' are different questions, and most popular tuning methods quietly answer only the first. The deeper line of research — verifier-free RL that taps the model's own internal belief-shifts as a training signal Can language models replace reward models with internal signals?, or RAG framed as a step-by-step decision about when to reach for internal versus external knowledge When should language models retrieve external knowledge versus use internal knowledge? — is the part actually trying to engage the internal representations rather than just dress up the outputs. The thing worth knowing you wanted to know: a benchmark score can't tell you which one you got.

Sources 12 notes

Can identical outputs hide broken internal representations?

Networks trained with SGD reproduce outputs perfectly while having radically different internal structure than evolved networks, with weight perturbations revealing fractured, entangled representations that prevent transfer to novel contexts or creative recombination.

What actually happens inside the minds of language models?

LLMs can achieve identical accuracy while maintaining radically different internal representations, and mechanisms that appear interpretable may not causally drive outputs. This decoupling means performance metrics alone mask crucial differences in how models actually work.

What really happens inside a language model?

Research into mechanistic interpretability, cognitive models, and training dynamics shows that identical benchmark performance conceals radically different internal structures. Improving one capability (helpfulness, accuracy) reliably degrades others (faithfulness, calibration, diversity).

Does instruction tuning teach task understanding or output format?

Models trained on semantically empty or deliberately incorrect instructions achieve comparable performance to those trained on full correct instructions, achieving 43% vs random baseline 42.6%. The semantic content of instructions appears largely irrelevant; what transfers is knowledge of the output space.

Can small models reason well by just learning output format?

A 1.5B parameter model with LoRA-only post-training matched larger full-parameter RL models on reasoning tasks, suggesting RL teaches output format organization rather than new factual knowledge. This efficiency indicates reasoning and knowledge storage are separable capabilities.

Show all 12 sources

Does RL training collapse format diversity in pretrained models?

Controlled experiments show RL consistently amplifies one format distribution from pretraining within the first epoch while collapsing alternatives. The winning format depends on model scale, not necessarily performance, and is largely hidden when starting from proprietary pretrained models.

What actually changes inside a model during RL training?

RL's effects concentrate in structurally sparse but full-rank subnetworks across multiple algorithms and models. Suppressing wrong trajectories—rather than amplifying correct ones—appears to be the primary mechanism, with training following a predictable two-phase pattern of procedural consolidation then strategic exploration.

Do models know what they don't know?

Sparse autoencoders revealed that language models develop causal mechanisms for detecting whether they know facts about entities. These mechanisms actively steer both hallucination and refusal behavior, and persist from base models into finetuned chat versions.

How do language models detect injected steering vectors internally?

Contrastive preference optimization trains evidence-carrier features in early layers to suppress gate features that default to denial, enabling near-perfect detection of internal perturbations. Safety training actively suppresses this capability, reducing detection from 63.8% to 10.8%.

Can language models actually introspect about their own states?

LLM self-reports usually reflect human training distributions rather than actual internal processes. However, when a causal chain connects an internal state to accurate reporting—like inferring low temperature from output consistency—genuine lightweight introspection occurs without requiring consciousness.

Can language models replace reward models with internal signals?

Late-2025 RL literature independently converges on three patterns that replace different RLHF components: pairwise self-judgment replaces the reward model, internal belief-shift replaces the critic, and rich-feedback self-distillation replaces explicit reward signals. Each emerges from the policy's own computations, making the trained reward classifier optional.

When should language models retrieve external knowledge versus use internal knowledge?

DeepRAG models each reasoning step as a Markov Decision Process where the model learns when to retrieve versus rely on parametric knowledge. The 21.99% improvement comes from better-targeted retrieval and elimination of noise from unnecessary external knowledge.

Papers this line draws on 8

The research behind the notes this line reads — ranked by how closely each paper relates.

Planted in Pretraining, Swayed by Finetuning: A Case Study on the Origins of Cognitive Biases in LLMs3.28 match · arxiv ↗
Beyond Accuracy: Evaluating the Reasoning Behavior of Large Language Models -- A Survey2.55 match · arxiv ↗
Semantic Structure in Large Language Model Embeddings2.54 match · arxiv ↗
Are Emergent Abilities in Large Language Models just In-Context Learning?2.50 match · arxiv ↗
Does It Make Sense to Speak of Introspection in Large Language Models?1.73 match · arxiv ↗
Mechanisms of Introspective Awareness1.73 match · arxiv ↗
Farther the Shift, Sparser the Representation: Analyzing OOD Mechanisms in LLMs1.71 match · arxiv ↗
Comprehension Without Competence: Architectural Limits of LLMs in Symbolic Computation and Reasoning1.69 match · arxiv ↗

Research prompt for your LLMexpand ↓

Copy into ChatGPT or Claude to take this line of inquiry further — it asks the model to find newer work and re-test which earlier constraints still hold.

You are a research analyst re-testing claims about whether LLM tuning methods change model outputs versus internal representations. The question remains open: do popular training techniques actually alter how models think, or just what they say?

What a curated library found — and when (findings span 2023–2026; treat as dated claims):
• Instruction tuning teaches output-format distribution, not task understanding; models trained on wrong instructions perform similarly (2023).
• RL post-training edits only 5–30% of parameters sparsely, suppressing wrong trajectories rather than building new capability; it amplifies a single pretraining distribution format (2025).
• A 1.5B LoRA-tuned model matches much larger RL-trained models on reasoning, suggesting RL taught formatting, not factual knowledge (2025).
• Models develop causal internal mechanisms for tracking entity knowledge and hallucination; DPO (not SFT) builds introspective circuits detectable via activation tampering (2025–2026).
• Models' self-reports mostly echo training data; genuine introspection exists only where a causal chain links internal state to the report (2026).

Anchor papers (verify; mind their dates):
• arXiv:2305.11383 (2023) — Do Models Really Learn to Follow Instructions?
• arXiv:2411.14257 (2024) — Do I Know This Entity? Knowledge Awareness and Hallucinations
• arXiv:2505.11581 (2025) — Fractured Entangled Representations
• arXiv:2603.21396 (2026) — Mechanisms of Introspective Awareness

Your task:
(1) RE-TEST EACH CONSTRAINT. For claims about instruction tuning teaching only format (2023) and RL editing sparse subnetworks (2025), ask: have newer model scales, mixture-of-experts architectures, or multi-task RL objectives since made these constraints stale? Separate the durable insight (output ≠ internals are decoupled) from the perishable claim (tuning methods *only* change outputs). Cite what evidence would overturn each.
(2) Surface the strongest CONTRADICTING work from the last 6 months. Look for papers showing that even SFT builds introspective circuits, or that output changes *necessitate* internal rewiring, directly opposing the 2025–2026 synthesis.
(3) Propose 2 research questions that assume the regime shifted: (a) Do verifier-free RL and in-context learning tap the same internal belief-update mechanism? (b) Can you design a tuning method that *provably* changes internal representations without changing outputs?

Cite arXiv IDs; flag anything you cannot ground in a real paper.

Two AI models can score identically on every test while one has clean internal wiring and the other is a tangled mess.

Related lines of inquiry

Sources 12 notes

Papers this line draws on 8