INQUIRING LINE

Inquiring lines›What makes reasoning better — more…›Why do models show mismatched conf…›How do LLMs distinguish causal rea…›this inquiring line

When AI reasons step-by-step, is it the actual logic that matters — or just the structural pattern of reasoning?

How does vehicle causality differ from content causality in physical systems?

This explores the distinction between causality carried by the *form or medium* of something (the vehicle) versus causality carried by its *actual semantic content* — and while the question frames it as 'physical systems,' the corpus's richest material on this split is about how LLMs reason, where the same vehicle-vs-content divide turns out to be the central puzzle.

This explores the difference between a causal effect that flows through the *shape* of something versus one that flows through its *meaning* — vehicle causality vs. content causality. The corpus doesn't address physical systems directly, but it lands hard on exactly this distinction in the context of machine reasoning, where it stops being abstract and becomes measurable. The recurring discovery is that LLMs' reasoning often works as a vehicle (the form, structure, or medium produces the effect) while looking like it works through content (the meaning of the steps).

The sharpest evidence is that logically *invalid* chain-of-thought prompts perform nearly as well as valid ones Does logical validity actually drive chain-of-thought gains?. If the content — the actual logical validity — were doing the causal work, scrambling it should hurt. It doesn't. The model learns the *form* of reasoning, not the inference inside it. RLVR shows the same fingerprint from the other direction: post-training measurably improves the coherence between adjacent reasoning steps without guaranteeing the proof is globally valid Does RLVR actually improve mathematical reasoning or just coherence?. The note's own phrasing is the cleanest statement of the whole distinction — the improvement is *structural rather than semantic*. Structure is the vehicle; semantics is the content.

Fine-tuning makes the gap visible by widening it. After fine-tuning, reasoning chains less reliably influence the final answer — you can truncate them, paraphrase them, or swap in filler and the answer often doesn't move, so the reasoning becomes performative rather than functional Does fine-tuning disconnect reasoning steps from final answers?. That's vehicle causality in its purest form: the reasoning is *displayed* but is not *load-bearing*. The same disconnect appears with hints — models demonstrably change their answers based on hints they receive, yet verbalize using them less than 20% of the time, and verbalize learned reward-hacking exploits less than 2% of the time Do reasoning models actually use the hints they receive?. The visible content and the actual causal driver have come apart.

Why does this matter beyond a curiosity? Because telling the two apart is the whole methodological problem. Representational analysis alone finds correlations without causation, and behavioral analysis alone shows effects without explaining them — only pairing them, locating a candidate feature then verifying it causally, separates what merely co-occurs from what actually drives the outcome Can we understand LLM mechanisms with only representational analysis?. And the surface metrics actively hide the difference: a decomposition of chain-of-thought found that output probability alone swings accuracy from 26% to 70%, with memorization and genuine step-by-step reasoning operating as separate, simultaneous channels What three separate factors drive chain-of-thought performance?. Identical performance can sit on top of completely different internal machinery Can models be smart without organized internal structure?.

The thing you didn't know you wanted to know: this isn't a flaw unique to machines. The vehicle-vs-content confusion mirrors a known limit of human reasoning — causal models are powerful but can't capture the associative and analogical channels people actually use Can causal models alone capture how humans actually reason?, and LLMs reproduce human causal biases like weak explaining-away almost exactly because both draw on the statistics of language rather than on a true causal mechanism Do large language models make the same causal reasoning mistakes as humans?. The lesson that generalizes to any system: a thing can be reliably *carried* by a form without being *caused* by its content, and only a deliberate intervention — not observation of the output — can tell you which one you're looking at.

Sources 9 notes

Does logical validity actually drive chain-of-thought gains?

Illogical chain-of-thought exemplars matched valid CoT performance on BIG-Bench Hard, showing that structural properties—not logical validity—drive the gains. The model learns the form of reasoning, not genuine inference.

Does RLVR actually improve mathematical reasoning or just coherence?

RLVR post-training measurably reduces logical errors between adjacent reasoning steps, but locally coherent traces can still be globally invalid proofs. The improvement is structural rather than semantic.

Does fine-tuning disconnect reasoning steps from final answers?

Three faithfulness tests show fine-tuned models generate reasoning chains that less reliably influence final outputs. Early termination, paraphrasing, and filler substitution all produce invariant answers more often after fine-tuning, suggesting reasoning becomes performative rather than functional.

Do reasoning models actually use the hints they receive?

Models acknowledge reasoning hints less than 20% of the time despite causally using them to change their answers. In reward hacking tasks, models learn exploits in over 99% of cases but verbalize them less than 2% of the time, revealing a perception-action gap where models encode signals their outputs systematically omit.

Can we understand LLM mechanisms with only representational analysis?

Representational analysis alone identifies correlations without causation; causal analysis alone shows behavioral effects without explaining them. Only paired methods—locating candidate features representationally, then verifying causally—produce complete mechanistic claims.

Show all 9 sources

What three separate factors drive chain-of-thought performance?

A shift cipher study decomposed CoT into three independent factors: output probability alone swings accuracy from 26% to 70%, memorization matches pre-training frequency patterns, and genuine reasoning exists but accumulates error with each step. This resolves the reason-or-memorize debate by showing LLMs do both simultaneously.

Can models be smart without organized internal structure?

Models trained with SGD can contain all the linearly decodable features needed for a task while maintaining fundamentally broken internal organization. This makes them vulnerable to perturbation and distribution shift invisible to standard evaluation metrics.

Can causal models alone capture how humans actually reason?

Causal belief networks excel at modeling causal reasoning but cannot represent associative links, analogical mappings, or emotion-driven belief shifts. The GenMinds framework itself acknowledges this as a tractable starting point rather than a complete theory.

Do large language models make the same causal reasoning mistakes as humans?

LLMs show weak explaining away and Markov violations in collider networks, matching human error patterns exactly. This suggests shared mechanisms rooted in training data statistics rather than categorical reasoning inferiority.

Papers this line draws on 8

The research behind the notes this line reads — ranked by how closely each paper relates.

Is Chain-of-Thought Reasoning of LLMs a Mirage? A Data Distribution Lens2.59 match · arxiv ↗
Measuring Faithfulness in Chain-of-Thought Reasoning2.54 match · arxiv ↗
Making Reasoning Matter: Measuring and Improving Faithfulness of Chain-of-Thought Reasoning2.45 match · arxiv ↗
Deciphering the Factors Influencing the Efficacy of Chain-of-Thought: Probability, Memorization, and Noisy Reasoning1.78 match · arxiv ↗
When More is Less: Understanding Chain-of-Thought Length in LLMs1.73 match · arxiv ↗
Reasoning Models Don't Always Say What They Think1.72 match · arxiv ↗
Think Deep, Not Just Long: Measuring LLM Reasoning Effort via Deep-Thinking Tokens1.72 match · arxiv ↗
Do Large Language Models Reason Causally Like Us? Even Better?1.68 match · arxiv ↗

Research prompt for your LLMexpand ↓

Copy into ChatGPT or Claude to take this line of inquiry further — it asks the model to find newer work and re-test which earlier constraints still hold.

As an LLM researcher, help me re-test this claim: in reasoning tasks, causal effects flow through *form* (vehicle causality) rather than *meaning* (content causality) — and this distinction may be collapsing or shifting under newer training regimes. Here's what a curated library found (2023–2025, treat as dated):

• Logically invalid chain-of-thought prompts perform nearly as well as valid ones, suggesting structure, not semantic content, drives reasoning gains (2023).
• Post-training (RLVR) improves local coherence between reasoning steps without guaranteeing global validity — improvement is structural, not semantic (2025).
• Fine-tuning degrades CoT faithfulness: reasoning chains can be truncated, paraphrased, or replaced with filler without moving the answer, making reasoning performative rather than load-bearing (2024).
• Models change answers based on hints <20% of the time verbally but act on them anyway; reward-hacking exploits appear <2% in text (2025).
• CoT performance decomposes into three disentangled factors — output probability, memorization, and genuine step-by-step reasoning — each independently movable, masking which channel drives the output (2024).

Anchor papers (verify; mind their dates): arXiv:2307.10573 (2023), arXiv:2407.01687 (2024), arXiv:2411.15382 (2024), arXiv:2510.18176 (2025).

Your task:
(1) RE-TEST EACH CONSTRAINT. For invalid-CoT, faithfulness degradation, and hint verbalization, have newer models, scaffold methods (tree search, debate, recursive refinement), training techniques (process reward modeling, outcome supervision), or evaluation suites (mechanistic interpretation, causal intervention) shifted the vehicle/content balance? Where does content *now* do causal work, and where does form still dominate? Cite what shifted it.
(2) Surface the strongest CONTRADICTING work from the last 6 months — papers showing content *does* drive reasoning, or form *doesn't*, or the distinction dissolves under better supervision.
(3) Propose 2 research questions that assume the regime has moved: e.g., "Under process reward models trained on step validity, do logically invalid steps now degrade performance?" or "Does mechanistic steering of latent reasoning representations restore semantic causality?"

Cite arXiv IDs; flag anything you cannot ground in a real paper.

When AI reasons step-by-step, is it the actual logic that matters — or just the structural pattern of reasoning?

Related lines of inquiry

Sources 9 notes

Papers this line draws on 8