INQUIRING LINE

Inquiring lines›How should agents manage and coord…›How can training approaches develo…›How does latent reasoning compare…›this inquiring line

Training AI to reason better might only require focusing on the 20% of its thinking where it genuinely faces a choice.

Can we improve reasoning by amplifying information at mutual information peaks?

This reads the question as: do reasoning gains come from concentrating learning on the rare high-information moments in a chain of thought — the forking points where the model's uncertainty is highest — rather than treating every token equally?

This explores whether reasoning improves when you find the few high-information moments in a model's thinking and pour the training signal there, instead of spreading it evenly across every token. The corpus says: surprisingly, yes — and the effect is dramatic. Only about 20% of tokens in a reasoning trace are high-entropy 'forking points' where the model genuinely chooses between paths, and training reinforcement learning exclusively on those tokens matches or even beats updating on all of them Do high-entropy tokens drive reasoning model improvements?. The minority carries the learning signal; the rest is filler. That's the strongest direct support for the question's premise — the 'peaks' aren't a metaphor, they're a measurable, exploitable minority.

But amplifying information isn't the same as amplifying confidence or length, and the corpus is sharp on this distinction. You can use the model's own confidence at the answer span as a reward to rank reasoning traces, which improves step-by-step reasoning while also fixing the calibration damage that human-feedback training tends to cause Can model confidence work as a reward signal for reasoning?. Confidence here works as a proxy for informativeness. The danger sign is when training optimizes the wrong signal: supervised fine-tuning raises benchmark accuracy while cutting 'Information Gain' by nearly 39% — the model learns to produce correct answers through post-hoc rationalization rather than genuine inferential steps, and standard metrics miss it entirely because they only check the final answer Does supervised fine-tuning improve reasoning or just answers?. So you can degrade the very thing the question wants to amplify, while your scoreboard goes up.

There's a deeper reframing worth knowing: the information you'd want to amplify may already be in the model, waiting to be elicited rather than created. Five independent methods all unlock reasoning that's latently present in base-model activations — post-training selects reasoning, it doesn't build it Do base models already contain hidden reasoning ability?. In the same spirit, verbose versus concise reasoning occupies distinct linear directions in activation space that you can steer with a single extracted vector and no retraining Can we steer reasoning toward brevity without retraining?. That suggests 'amplifying at the peaks' might be done at inference time by nudging activations, not just by reweighting the training loss.

Two cautions keep this from being a free lunch. More signal is not monotonically better: chain-of-thought accuracy follows an inverted-U with length, and capable models actually prefer shorter chains Why does chain of thought accuracy eventually decline with length?; pushing thinking tokens from ~1,100 to ~16K dropped accuracy from 87% to 70% as the model overthought easy problems Does more thinking time always improve reasoning accuracy?. And the fluent reasoning you'd be amplifying can be hollow — chain-of-thought degrades predictably outside its training distribution, imitating the form of reasoning without valid underlying logic Does chain-of-thought reasoning actually generalize beyond training data?. So the honest answer: amplifying genuine information at the high-entropy forking points has real evidence behind it, but only if your signal tracks information gain rather than confidence-shaped surface correctness, and only up to the point where more becomes noise.

Sources 8 notes

Do high-entropy tokens drive reasoning model improvements?

Only ~20% of tokens exhibit high entropy as pivotal reasoning decision points; RLVR primarily adjusts these forking tokens. Training exclusively on them matches or exceeds full-gradient performance, revealing that the minority carries the learning signal.

Can model confidence work as a reward signal for reasoning?

RLSF uses answer-span confidence to rank reasoning traces, creating synthetic preferences that strengthen step-by-step reasoning while reversing RLHF's calibration degradation—without requiring human labels or external verifiers.

Does supervised fine-tuning improve reasoning or just answers?

Supervised fine-tuning improves final-answer accuracy on benchmarks but cuts Information Gain by 38.9 percent, meaning models generate correct answers through post-hoc rationalization rather than genuine inferential steps. Standard metrics miss this degradation because they only measure final correctness.

Do base models already contain hidden reasoning ability?

Five independent mechanisms—RL steering, critique fine-tuning, decoding changes, SAE feature steering, and RLVR—all elicit reasoning already present in base model activations. Post-training selects rather than creates reasoning; the bottleneck is elicitation, not capability acquisition.

Can we steer reasoning toward brevity without retraining?

Activation-Steered Compression extracts a single vector from 50 paired examples to reduce chain-of-thought length by 67% while maintaining accuracy and achieving 2.73x speedup. The method is training-free and generalizes across model sizes and domains.

Show all 8 sources

Why does chain of thought accuracy eventually decline with length?

Task accuracy peaks at intermediate CoT length, with optimal length increasing alongside task difficulty but decreasing with model capability. RL training naturally gravitates toward shorter chains as models improve, revealing that simplicity emerges from reward signals rather than explicit training.

Does more thinking time always improve reasoning accuracy?

Increasing thinking tokens from ~1,100 to ~16K reduced benchmark accuracy from 87.3% to 70.3%, revealing a non-monotonic relationship where models overthink easy problems and underthink hard ones.

Does chain-of-thought reasoning actually generalize beyond training data?

DataAlchemy experiments show CoT fails systematically under distributional shifts in task, length, and format. Models produce fluent but logically inconsistent reasoning — imitating reasoning form without valid underlying logic.

Papers this line draws on 8

The research behind the notes this line reads — ranked by how closely each paper relates.

When More is Less: Understanding Chain-of-Thought Length in LLMs1.81 match · arxiv ↗
Break the Chain: Large Language Models Can be Shortcut Reasoners1.77 match · arxiv ↗
Eliciting Reasoning in Language Models with Cognitive Tools1.74 match · arxiv ↗
When More Thinking Hurts: Overthinking in LLM Test-Time Compute Scaling1.74 match · arxiv ↗
Think Deep, Not Just Long: Measuring LLM Reasoning Effort via Deep-Thinking Tokens1.73 match · arxiv ↗
Mining Hidden Thoughts from Texts: Evaluating Continual Pretraining with Synthetic Data for LLM Reasoning1.73 match · arxiv ↗
Understanding and Mitigating Premature Confidence for Better LLM Reasoning1.73 match · arxiv ↗
Local Coherence or Global Validity? Investigating RLVR Traces in Math Domains1.72 match · arxiv ↗

Research prompt for your LLMexpand ↓

Copy into ChatGPT or Claude to take this line of inquiry further — it asks the model to find newer work and re-test which earlier constraints still hold.

You are a reasoning-systems researcher re-testing whether LLM reasoning improves when training reinforcement signals concentrate at high-entropy 'information peaks' rather than spreading uniformly across reasoning traces. A curated library (papers 2024–08/2025) found strong support for this premise, but those claims are now months old. Treat them as dated constraints to validate, not current truth.

**What a curated library found — and when (dated claims, not current truth):**
Findings span Feb 2024–Aug 2025:
- Only ~20% of tokens in reasoning traces are high-entropy forking points; RL on that minority matches or exceeds training on all tokens (Jun 2025).
- Model confidence at answer spans works as an intrinsic reward proxy, improving step reasoning while restoring calibration damage from human feedback (Jul 2025).
- Supervised fine-tuning raises benchmark accuracy while cutting Information Gain by ~39%, hiding post-hoc rationalization via standard metrics (Jul 2025).
- Base models possess latent reasoning; post-training selects rather than builds it; verbose and concise reasoning occupy distinct activation-space directions steerable at inference time with no retraining (Jun–Jul 2025).
- Chain-of-thought accuracy follows an inverted-U with length; models prefer shorter chains; pushing from ~1.1K to ~16K tokens drops accuracy 87% → 70% (Feb–Apr 2025).
- CoT effectiveness degrades outside training distribution, imitating reasoning form without valid logic (Aug 2025).

**Anchor papers (verify; mind their dates):**
- arXiv:2506.01939 (Jun 2025) — High-Entropy Minority Tokens Drive RL
- arXiv:2507.04742 (Jul 2025) — Activation Steering for CoT Compression
- arXiv:2508.01191 (Aug 2025) — Is Chain-of-Thought a Mirage?
- arXiv:2507.21931 (Jul 2025) — Self-Feedback RL

**Your task:**
(1) RE-TEST EACH CONSTRAINT. For the 20% high-entropy claim, the confidence-reward proxy, and the inverted-U length effect: has newer post-training (Sept 2025–present), scaling, or better information-theoretic metrics since relaxed these limits? Distinguish the durable question (what information drives reasoning?) from perishable limitations (current training regimes). Cite what changed it.
(2) Surface the strongest CONTRADICTING work from the last 6 months. The Aug 2025 "Mirage" paper suggests CoT may be distribution-bound theater — does this undermine the peaks hypothesis, or does it sharpen the distinction between form and information gain?
(3) Propose 2 research questions that ASSUME the regime may have shifted: e.g., (a) Can we identify information peaks without post-hoc entropy, using only base-model structure? (b) Does selective amplification at peaks remain beneficial if reasoning is largely latent elicitation, not creation?

**Guardrail:** Cite arXiv IDs; flag anything you cannot ground in a real paper.

Training AI to reason better might only require focusing on the 20% of its thinking where it genuinely faces a choice.

Related lines of inquiry

Sources 8 notes

Papers this line draws on 8