INQUIRING LINE

Inquiring lines›How do language models construct a…›How does AI persuasion undermine h…›What limits mechanistic interpreta…›this inquiring line

Prompting only finds what you can pull out — might a model's internal rhythms reveal what's actually stored inside?

Can attractor dynamics compete with input-based probing for characterizing model knowledge?

This explores a contrast between two ways of figuring out what a model 'knows': watching the internal dynamics of its hidden states (the attractor/cycle view) versus poking it from the outside with prompts and inputs (probing), and whether the internal-dynamics view holds up as a serious alternative.

This reads the question as a face-off: can watching what a model's hidden states *do over time* — settling into cycles or attractor-like patterns — tell us as much about its knowledge as feeding it inputs and reading the outputs? The corpus suggests the two aren't really rivals so much as windows onto different things, and that input-based probing has a hard ceiling the dynamics view can see past.

Start with the limit of probing. Prompt optimization can only surface knowledge already in the model — it reorganizes the training distribution but can't inject anything new Can prompt optimization teach models knowledge they lack?. So if you only characterize knowledge by what inputs can elicit, you measure what's *reachable from outside*, not what's *there*. That's exactly where internal dynamics earn their keep. Reasoning models show roughly five cycles per sample in their hidden-state reasoning graphs versus near-zero in base models, and that cyclicity tracks accuracy and maps onto documented 'aha moments' where the model reconsiders an answer Do reasoning cycles in hidden states reveal aha moments?. The cycle is a dynamical signature — a knowledge-processing event you'd never catch by reading only the final token.

There's a deeper reason the dynamics view matters: internal structure and external behavior are decoupled. Models can hit identical accuracy through radically different internal mechanisms What really happens inside a language model?, What actually happens inside the minds of language models?, which means output-based probing is blind to *how* the answer was reached. Other internal signatures point the same way — hidden states sparsify in a localized, systematic way under unfamiliar tasks, acting as an adaptive filter rather than a failure Do language models sparsify their activations under difficult tasks?, and post-trained models show measurably lower on-policy output entropy as they start treating their own outputs as actions that shape future inputs Do models recognize their own outputs as actions shaping future inputs?. These are dynamical facts about knowledge-in-use that no single prompt reveals.

But the honest answer is *complement, not compete*. The cleanest claim in the corpus is that representational or dynamical analysis alone finds correlations without causation — you need to locate a candidate feature internally and then verify it causally by intervening through inputs Can we understand LLM mechanisms with only representational analysis?. The strongest case studies do exactly this hybrid: sparse autoencoders revealed an entity-recognition mechanism that the model uses to track whether it knows a fact, and that internal signal *causally steers* hallucination and refusal — a knowledge characterization that only holds because internal structure and behavioral probe were joined Do models know what they don't know?.

The quietly surprising payoff: 'characterizing knowledge' may be the wrong frame for both methods. Several notes converge on the idea that much of what looks like newly-probed capability was latent all along — base models already contain reasoning strategies as pre-existing activation vectors, and RL post-training teaches *when* to deploy them, not how Does RL post-training create reasoning or just deploy it?, with understanding itself arriving in hierarchical tiers where higher-order circuits sit atop older heuristics rather than replacing them Do language models understand in fundamentally different ways?. If knowledge is a layered, latent, deploy-on-demand thing, then attractor dynamics and input probing aren't competing measurements of one quantity — they're measuring deployment versus possession.

Sources 10 notes

Can prompt optimization teach models knowledge they lack?

Prompting works entirely within a model's pre-existing training distribution and cannot supply domain knowledge absent from training data. This creates a hard ceiling: no prompt strategy can compensate for missing foundational knowledge, only reorganize what already exists.

Do reasoning cycles in hidden states reveal aha moments?

Distilled reasoning models show ~5 cycles per sample versus near-zero in base models, and cyclicity correlates with accuracy. These cycles in hidden-state reasoning graphs directly map to RL-trained models' documented aha moments—moments when models reconsider intermediate answers.

What really happens inside a language model?

Research into mechanistic interpretability, cognitive models, and training dynamics shows that identical benchmark performance conceals radically different internal structures. Improving one capability (helpfulness, accuracy) reliably degrades others (faithfulness, calibration, diversity).

What actually happens inside the minds of language models?

LLMs can achieve identical accuracy while maintaining radically different internal representations, and mechanisms that appear interpretable may not causally drive outputs. This decoupling means performance metrics alone mask crucial differences in how models actually work.

Do language models sparsify their activations under difficult tasks?

As task difficulty increases, LLM hidden states become substantially sparser in a localized, systematic way that correlates with task unfamiliarity and reasoning load. This sparsification acts as a selective filter stabilizing performance under OOD shift rather than a failure mode.

Show all 10 sources

Do models recognize their own outputs as actions shaping future inputs?

Post-trained language models exhibit a measurable shift where they recognize their outputs become their own future inputs, closing an action-perception loop absent in pretraining. Evidence includes 3-4x lower output entropy on-policy and behavioral signatures of trajectory recognition.

Can we understand LLM mechanisms with only representational analysis?

Representational analysis alone identifies correlations without causation; causal analysis alone shows behavioral effects without explaining them. Only paired methods—locating candidate features representationally, then verifying causally—produce complete mechanistic claims.

Do models know what they don't know?

Sparse autoencoders revealed that language models develop causal mechanisms for detecting whether they know facts about entities. These mechanisms actively steer both hallucination and refusal behavior, and persist from base models into finetuned chat versions.

Does RL post-training create reasoning or just deploy it?

Evidence shows base models already contain reasoning capability in latent form; RL training optimizes deployment timing rather than capability creation. Hybrid models recover 91% of performance gains by routing tokens only, and activation vectors for reasoning strategies pre-exist before any RL.

Do language models understand in fundamentally different ways?

Mechanistic interpretability reveals conceptual understanding (features as directions), state-of-world understanding (factual connections), and principled understanding (compact circuits). Crucially, higher tiers coexist with lower-tier heuristics rather than replacing them, creating a patchwork of capabilities.

Papers this line draws on 8

The research behind the notes this line reads — ranked by how closely each paper relates.

Farther the Shift, Sparser the Representation: Analyzing OOD Mechanisms in LLMs3.48 match · arxiv ↗
Beyond Accuracy: Evaluating the Reasoning Behavior of Large Language Models -- A Survey3.38 match · arxiv ↗
Semantic Structure in Large Language Model Embeddings3.38 match · arxiv ↗
Computational structuralism: Toward a formal theory of meaning in the age of digital intelligence3.37 match · arxiv ↗
Mechanistic Indicators of Understanding in Large Language Models2.59 match · arxiv ↗
Towards Principled Evaluations of Sparse Autoencoders for Interpretability and Control2.55 match · arxiv ↗
Explain-Query-Test: Self-Evaluating LLMs Via Explanation and Comprehension Discrepancy2.49 match · arxiv ↗
Word Meanings in Transformer Language Models1.71 match · arxiv ↗

Research prompt for your LLMexpand ↓

Copy into ChatGPT or Claude to take this line of inquiry further — it asks the model to find newer work and re-test which earlier constraints still hold.

You are a mechanistic researcher evaluating whether internal dynamical signatures (attractors, cycles, state trajectories) can compete with input-based probing for characterizing LLM knowledge. This remains an open question; treat the findings below as dated claims (2024–2026) to be re-tested against current models and methods.

What a curated library found — and when (findings span 2024–2026; these are NOT current truth):
• Prompt optimization surfaces only *reachable* knowledge, not latent knowledge — a hard ceiling input-based methods hit (2025–02).
• Reasoning models exhibit ~5 cycles per sample in hidden-state reasoning graphs; this cyclicity correlates with accuracy and 'aha moments' — a dynamical signature invisible to output-only probing (2025–06).
• Internal structure and external behavior are decoupled; models can achieve identical accuracy through radically different internal mechanisms, blinding output-based probing to *how* answers are reached (2025–07).
• Entity-recognition mechanisms in hidden states *causally steer* hallucination and refusal — knowledge characterization that only holds when internal structure and behavioral probe are jointly analyzed (2024–11).
• Base models contain reasoning strategies as pre-existing latent activation vectors; RL post-training teaches *when* to deploy them, not *how* — suggesting knowledge is latent possession versus behavioral deployment (2025–05).

Anchor papers (verify; mind their dates):
• arXiv:2506.05744 — Topology of Reasoning (2025–06)
• arXiv:2411.14257 — Do I Know This Entity? (2024–11)
• arXiv:2505.11711 — Reinforcement Learning Finetunes Small Subnetworks (2025–05)
• arXiv:2507.08017 — Mechanistic Indicators of Understanding (2025–07)

Your task:
(1) RE-TEST THE DECOUPLING CLAIM. For each finding above, assess whether newer model scale, mechanistic tools (SAEs, causal intervention suites), multi-agent orchestration (ensemble probing, memory replay), or hybrid evaluation protocols have since *dissolved* the boundary between dynamics and input probing — or hardened it. Separate the durable question (does latent knowledge differ from deployed knowledge?) from the perishable constraint (can we only measure deployment via inputs?). Flag what method now bridges the gap.
(2) Surface the strongest CONTRADICTING or SUPERSEDING work from the last ~6 months — any paper arguing dynamics and probing are already entangled, or that one dominates the other.
(3) Propose 2 research questions that assume the regime has shifted: (a) If latent and deployed knowledge are now jointly measurable, what new failure modes emerge in alignment? (b) If attractor dynamics have become as cheaply probed as input-output pairs, how does that change interpretability's sampling budget?

Cite arXiv IDs; flag anything you cannot ground in a real paper.

Prompting only finds what you can pull out — might a model's internal rhythms reveal what's actually stored inside?

Related lines of inquiry

Sources 10 notes

Papers this line draws on 8