INQUIRING LINE

What does pass@k reveal about base model reasoning capacity?

This explores what the pass@k metric — letting a model take many attempts and counting it a success if any one lands — tells us about whether reasoning lives inside base models already or gets installed by training.


This question is really asking whether reasoning is something a base model already has, or something post-training builds from scratch — and pass@k is the lens that exposes the difference. The corpus comes down hard on the "already has it" side. The clearest statement is that base models contain latent reasoning capability that minimal training merely unlocks: five independent techniques — RL steering, critique fine-tuning, decoding tweaks, feature steering, and RLVR — all surface reasoning that was sitting in base-model activations the whole time Do base models already contain hidden reasoning ability?. The punchline for pass@k is that post-training *selects* reasoning rather than *creating* it. A base model sampled enough times often produces the correct chain on its own; what RLVR-style training does is raise the odds that the first sample is the good one. So a high pass@k for the base model and a low pass@1 isn't a contradiction — it's evidence the capability was latent, and the bottleneck is elicitation, not acquisition.

That reframes a lot of the apparent "reasoning ceilings" you read about elsewhere in the collection. Several notes argue that when reasoning models collapse, the failure isn't a missing capability — it's something narrower. One shows collapses are execution failures, not reasoning failures: a text-only model can know an algorithm yet be unable to grind through its steps, and giving it tools lets it sail past the supposed cliff Are reasoning model collapses really failures of reasoning?. Another finds breakdowns track instance-level *unfamiliarity* rather than task complexity — models succeed on any chain resembling something they've seen and stumble on novel instance structures Do language models fail at reasoning due to complexity or novelty?. Both fit the pass@k picture: capacity exists but is unevenly retrievable, and a single sample undersells what's actually in there.

But the corpus won't let you read pass@k as proof of deep symbolic competence either, and this is the part you didn't know you wanted to know. If base models can produce correct reasoning across enough samples, *what* are they producing? The skeptical notes suggest a lot of it is fluent imitation of reasoning form rather than genuine inference — chain-of-thought reproduces familiar schemata and degrades predictably under distribution shift Does chain-of-thought reasoning reveal genuine inference or pattern matching?, Does chain-of-thought reasoning actually generalize beyond training data?. Models lean on semantic associations, not formal logic; strip the familiar semantics away and performance collapses even when the rules are handed to them Do large language models reason symbolically or semantically?. And reasoning traces themselves turn out to be unreliable witnesses — invalid logical steps drive nearly the same performance gains as valid ones Do reasoning traces show how models actually think?.

Put those together and pass@k reveals something double-edged. It shows base models carry a large reservoir of latent, distribution-bounded reasoning that training elicits rather than invents — which is why a base model with enough tries can match a tuned one. But it also inflates how much *genuine* reasoning you'd attribute to the model, because some fraction of those winning samples are well-shaped imitations that happen to land. The reservoir is real; its contents are a mix of competence and convincing pattern-completion.

If you want to push on the boundary, two notes sharpen it. One shows frontier reasoning models hit only ~20-23% on constraint-satisfaction problems demanding real backtracking — a ceiling pass@k can't sample its way past when the capability genuinely isn't latent Can reasoning models actually sustain long-chain reflection?. Another argues training regime, not inference budget, is what makes extra tokens productive — so simply cranking up k on a non-reasoning model won't close the gap with a model trained to reason Can non-reasoning models catch up with more compute?. The reservoir has a floor as well as a depth.


Sources 9 notes

Do base models already contain hidden reasoning ability?

Five independent mechanisms—RL steering, critique fine-tuning, decoding changes, SAE feature steering, and RLVR—all elicit reasoning already present in base model activations. Post-training selects rather than creates reasoning; the bottleneck is elicitation, not capability acquisition.

Are reasoning model collapses really failures of reasoning?

Models confined to text-only generation cannot execute multi-step procedures at scale, even when they know the underlying algorithm. Tool-enabled models solve problems beyond the supposed reasoning cliff, suggesting the bottleneck is procedural execution bandwidth.

Do language models fail at reasoning due to complexity or novelty?

LRMs don't break at complexity thresholds but at instance-novelty boundaries. Models fit instance-based patterns rather than generalizable algorithms, so any reasoning chain succeeds if trained on similar instances, regardless of length.

Does chain-of-thought reasoning reveal genuine inference or pattern matching?

CoT works by constraining models to reproduce familiar reasoning patterns from training, not by enabling novel symbolic reasoning. Performance degrades predictably under distribution shifts—the signature of imitation rather than capability emergence.

Does chain-of-thought reasoning actually generalize beyond training data?

DataAlchemy experiments show CoT fails systematically under distributional shifts in task, length, and format. Models produce fluent but logically inconsistent reasoning — imitating reasoning form without valid underlying logic.

Do large language models reason symbolically or semantically?

When semantic content is decoupled from reasoning tasks, LLM performance collapses even with correct rules in context. Models rely on parametric commonsense and token associations rather than formal logical manipulation, constraining reasoning to training distribution semantics.

Do reasoning traces show how models actually think?

LLM reasoning traces perform as persuasive appearances rather than reliable explanations of computation. Invalid logical steps perform nearly as well as valid ones, and corrupted traces generalize comparably, showing that semantic correctness is not what produces the performance gains.

Can reasoning models actually sustain long-chain reflection?

DeepSeek-R1 and o1-preview achieve only 20-23.6% exact match on 850 constraint satisfaction problems requiring genuine backtracking. This ceiling reveals that reflective reasoning fluency does not translate to actual problem-solving competence on unfamiliar instance structures.

Can non-reasoning models catch up with more compute?

Reasoning models persistently outperform non-reasoning models regardless of inference budget because training instills a reasoning protocol that makes additional tokens productive. The gap is fundamentally about deployment mechanisms and training structure, not raw capability.

Research prompt for your LLMexpand ↓

Copy into ChatGPT or Claude to take this line of inquiry further — it asks the model to find newer work and re-test which earlier constraints still hold.

You are a reasoning-capabilities analyst. The question: Does pass@k measure genuine latent reasoning capacity in base models, or does it conflate latent competence with fluent imitation?

What a curated library found — and when (dated claims, not current truth):
Findings span 2023–2026. A curated library reported:
• Base models contain latent reasoning capability that post-training *selects* rather than creates; high pass@k + low pass@1 signals capacity was there, not that training built it (2024–2026).
• Reasoning collapses are execution failures (tooling, distribution shift, instance-level unfamiliarity) not missing capability — pass@k undersells what's retrievable (2024–2026).
• Chain-of-thought reproduces familiar schemata and is distribution-bounded; models are semantic reasoners leaning on associations, not symbolic logic (2023–2025).
• ~20–23% ceiling on constraint-satisfaction tasks demanding true backtracking; pass@k cannot sample past absent capability (2026).
• Training regime, not inference budget, determines whether extra tokens help; cranking k on non-reasoning models won't close the gap (2026).

Anchor papers (verify; mind their dates):
• arXiv:2305.14825 (2023-05): In-Context Semantic Reasoners rather than Symbolic
• arXiv:2508.01191 (2025-08): Is Chain-of-Thought Reasoning a Mirage? Data Distribution Lens
• arXiv:2604.15726 (2026-04): LLM Reasoning Is Latent, Not the Chain of Thought
• arXiv:2602.06176 (2026-02): Large Language Model Reasoning Failures

Your task:
(1) RE-TEST EACH CONSTRAINT. For each finding above—latent-vs-created, distribution-boundedness, semantic-vs-symbolic, backtracking ceilings, training-regime primacy—judge whether newer decoder architectures (o1-style process reward models, test-time scaling beyond pass@k), training methods (constitution AI, mixture-of-experts reasoning), or evaluations (grounded symbolic benchmarks, mechanistic probes) have since relaxed or overturned it. Separate the durable question (does pass@k reveal true capacity?) from the perishable limitation (e.g., current ceiling on backtracking). Cite what resolved or confirmed each constraint, plainly.
(2) Surface the strongest CONTRADICTING or SUPERSEDING work from the last ~6 months that either defends pass@k as a faithful capacity signal or abandons it entirely in favor of a new metric.
(3) Propose 2 research questions that assume the regime may have moved: e.g., (a) Do reasoning-specialized models (post-2025) show latent capacity on symbolic tasks that pass@k cannot access even at high k? (b) Does mechanistic interpretability reveal whether high-pass@k samples reflect learned search or stochastic luck?

Cite arXiv IDs; flag anything you cannot ground in a real paper.

Next inquiring lines