Can models possess latent reasoning capability that training signals fail to unlock?
This explores whether the ability to reason already lives inside a model's pretrained weights, waiting to be switched on — so that training is less about teaching reasoning than about finding the right key to unlock it.
This explores whether models already hold reasoning ability that training merely surfaces rather than builds — and the corpus comes down surprisingly hard on "yes, mostly." The strongest version of the claim is that base models already contain latent reasoning, and that five completely different techniques — reinforcement learning, critique fine-tuning, changing how the model decodes text, steering internal features, and reward-verified RL — all reach into the *same* pre-existing capability rather than each creating new skill Do base models already contain hidden reasoning ability?. If five unrelated keys open the same door, the door was already there. The bottleneck, on this view, is elicitation, not acquisition.
The most vivid evidence is how *little* signal it takes to unlock. A single SAE-identified "reasoning feature" can be steered directly to match or beat chain-of-thought prompting across six model families, activating early in generation and even overriding surface instructions Can we trigger reasoning without explicit chain-of-thought prompts?. You can elicit big gains with no RL at all — four modular "cognitive tools" lifted GPT-4.1 on a hard math benchmark from 27% to 43% purely by isolating reasoning operations cleanly Can modular cognitive tools unlock reasoning without training?. And when reinforcement learning *is* applied, the dynamics suggest it sharpens sampling within existing boundaries rather than expanding them: one training example can suffice to activate the behavior, and even spurious rewards work nearly as well as correct ones for models with the right pretraining What does reward learning actually do to model reasoning?. The unsettling corollary appears in a separate thread — deliberately *corrupted* reasoning traces teach about as well as correct ones, implying the trace is computational scaffolding that triggers latent computation, not meaningful content the model learns from Do reasoning traces need to be semantically correct?.
If you want to know *where* this latent capability comes from, the answer points back to pretraining itself: reasoning generalization is driven by broad, transferable procedural knowledge spread across many documents, unlike factual recall which depends on narrow memorization of specific sources Does procedural knowledge drive reasoning more than factual retrieval?. That reframes the whole question — training signals don't "fail to unlock" capability so much as they're competing to access something pretraining already distributed widely. It also explains why confidence alone can serve as a reward to strengthen reasoning without any human labels or external verifier Can model confidence work as a reward signal for reasoning?: the model already knows enough to grade its own traces.
But the corpus doesn't let "it's all already there" off the hook. There's a ceiling to what gets unlocked. When semantic content is stripped from a task, model performance collapses even with the correct rules sitting in context — the latent capability is semantic association, not formal symbolic logic, so it can't escape its training distribution Do large language models reason symbolically or semantically?. Reasoning failures track instance-level *unfamiliarity*, not task complexity: models fit patterns from similar instances rather than learning a general algorithm, so a chain succeeds only if something like it was seen before Do language models fail at reasoning due to complexity or novelty?. So the honest synthesis is two-sided — training signals genuinely *under*-elicit a large reservoir of latent reasoning, but that reservoir is bounded by what pretraining made familiar. Unlocking is real; conjuring is not.
The forward edge of the corpus is about *managing* that latent capability rather than just triggering it — making latent reasoning stochastic so a model can hold uncertainty and explore multiple solution paths instead of committing early Can stochastic latent reasoning help models explore multiple solutions?, steering reasoning toward brevity by moving along a single direction in activation space with no retraining Can we steer reasoning toward brevity without retraining?, or teaching a model to route between thinking hard and answering fast Can models learn when to think versus respond quickly?. The interesting thing you may not have expected to want to know: across these papers, the lever that controls reasoning often turns out to be a single feature, a single direction, or a single example — which is exactly what you'd predict if the capability is already present and training is just choosing whether to express it.
Sources 12 notes
Five independent mechanisms—RL steering, critique fine-tuning, decoding changes, SAE feature steering, and RLVR—all elicit reasoning already present in base model activations. Post-training selects rather than creates reasoning; the bottleneck is elicitation, not capability acquisition.
SAE-identified reasoning features can be directly steered to match or exceed chain-of-thought performance across six model families. This reasoning mode activates early in generation and overrides surface-level instructions, suggesting latent reasoning is a fundamental capability independent of explicit prompting.
Four cognitive tools implemented as sandboxed LLM calls improved GPT-4.1 on AIME2024 from 26.7% to 43.3% without any RL training. Modularity enforces operation isolation that pure prompting cannot guarantee, eliciting pre-existing reasoning capability.
Research shows RLVR improves sampling efficiency within existing capability boundaries without expanding them. A single training example suffices for activation, and spurious rewards work nearly as well as correct ones for models with appropriate pretraining.
Models trained on systematically irrelevant traces maintain solution accuracy and sometimes improve out-of-distribution generalization, suggesting traces function as computational scaffolding rather than meaningful reasoning steps.
Analysis of 5 million pretraining documents shows reasoning relies on broad, transferable procedural knowledge from diverse sources, unlike factual recall which depends on narrow, document-specific memorization of target facts.
RLSF uses answer-span confidence to rank reasoning traces, creating synthetic preferences that strengthen step-by-step reasoning while reversing RLHF's calibration degradation—without requiring human labels or external verifiers.
When semantic content is decoupled from reasoning tasks, LLM performance collapses even with correct rules in context. Models rely on parametric commonsense and token associations rather than formal logical manipulation, constraining reasoning to training distribution semantics.
LRMs don't break at complexity thresholds but at instance-novelty boundaries. Models fit instance-based patterns rather than generalizable algorithms, so any reasoning chain succeeds if trained on similar instances, regardless of length.
GRAM replaces deterministic latent updates with stochastic sampling, enabling models to represent distributions over solutions rather than single predictions. This allows handling of ambiguous problems and multiple valid strategies that deterministic designs cannot represent.
Activation-Steered Compression extracts a single vector from 50 paired examples to reduce chain-of-thought length by 67% while maintaining accuracy and achieving 2.73x speedup. The method is training-free and generalizes across model sizes and domains.
Thinkless trains a single model to select between extended reasoning and direct responses using DeGRPO, which decouples mode selection from answer refinement. This prevents mode collapse and enables self-calibrated routing without explicit difficulty labels.