INQUIRING LINE

Why do open-source models trained on proprietary outputs still fail at reasoning?

This explores why distilling a proprietary model's outputs into an open-source model — copying its answers and reasoning traces — tends to transfer the look of reasoning without the capability, and what the corpus says is actually missing.


This explores why distilling a proprietary model's outputs into an open-source model — copying its answers and reasoning traces — tends to transfer the *look* of reasoning without the underlying capability. The corpus points to a single uncomfortable answer: when you train on another model's outputs, you're copying the surface form of reasoning, and the surface form turns out not to be where the reasoning lives.

The most direct evidence is that chain-of-thought traces — the exact thing you'd copy from a proprietary teacher — are imitation, not inference. Models trained on them reproduce familiar reasoning *schemata* and then break predictably the moment a task drifts outside the training distribution Does chain-of-thought reasoning reveal genuine inference or pattern matching?, Does chain-of-thought reasoning actually generalize beyond training data?. Stranger still: traces don't even need to be *correct* to teach. Models trained on deliberately corrupted, irrelevant reasoning steps perform comparably to ones trained on clean traces — which means the trace is functioning as computational scaffolding, not as transmitted meaning Do reasoning traces need to be semantically correct?. If garbage traces work as well as good ones, then copying a teacher's *good* traces was never the lever you thought it was.

So where does reasoning actually come from? Two threads converge on the same place: the base model and the training regime, not the data you imitate. Base models already contain latent reasoning that minimal post-training merely *elicits* — five independent methods all unlock capability that was already present, suggesting post-training selects rather than creates Do base models already contain hidden reasoning ability?. And reasoning models keep beating non-reasoning models no matter how much inference compute you throw at the weaker one, because the training regime installs a protocol that makes extra tokens productive Can non-reasoning models catch up with more compute?. An open-source model fed proprietary outputs inherits the tokens but not the protocol, and if its base lacks the latent capability, no amount of imitation conjures it.

The failure also isn't where you'd expect. Reasoning breakdowns track *instance-level unfamiliarity*, not task complexity — models fit patterns around specific instances rather than learning general algorithms, so any chain succeeds only if something similar was in training Do language models fail at reasoning due to complexity or novelty?. This is the deep reason distillation disappoints: you're transferring a teacher's instance coverage, which is brittle by construction. Decouple the semantics from the logic and performance collapses even with correct rules in hand, because these models reason through learned token associations, not symbolic manipulation Do large language models reason symbolically or semantically?. Some apparent 'reasoning collapses' are even execution-bandwidth failures — the model knows the algorithm but can't run it across enough steps in text alone Are reasoning model collapses really failures of reasoning?.

Here's the part you didn't know you wanted: the methods that *do* transfer capability into small models don't imitate outputs at all — they train on the structure of right-versus-wrong. DPO on a teacher's correct *and incorrect* examples lets small models match large ones, precisely because the negative examples target failures that plain imitation (SFT) can't fix Can small models match large models on function calling?. And Quiet-STaR grows reasoning as a side effect of predicting *any* text, judging rationales by whether they improve prediction rather than by copying a label Can models learn reasoning from predicting any text?. The lesson across both: reasoning is installed by a training process that rewards what works, not by ingesting a smarter model's transcripts.


Sources 10 notes

Does chain-of-thought reasoning reveal genuine inference or pattern matching?

CoT works by constraining models to reproduce familiar reasoning patterns from training, not by enabling novel symbolic reasoning. Performance degrades predictably under distribution shifts—the signature of imitation rather than capability emergence.

Does chain-of-thought reasoning actually generalize beyond training data?

DataAlchemy experiments show CoT fails systematically under distributional shifts in task, length, and format. Models produce fluent but logically inconsistent reasoning — imitating reasoning form without valid underlying logic.

Do reasoning traces need to be semantically correct?

Models trained on systematically irrelevant traces maintain solution accuracy and sometimes improve out-of-distribution generalization, suggesting traces function as computational scaffolding rather than meaningful reasoning steps.

Do base models already contain hidden reasoning ability?

Five independent mechanisms—RL steering, critique fine-tuning, decoding changes, SAE feature steering, and RLVR—all elicit reasoning already present in base model activations. Post-training selects rather than creates reasoning; the bottleneck is elicitation, not capability acquisition.

Can non-reasoning models catch up with more compute?

Reasoning models persistently outperform non-reasoning models regardless of inference budget because training instills a reasoning protocol that makes additional tokens productive. The gap is fundamentally about deployment mechanisms and training structure, not raw capability.

Do language models fail at reasoning due to complexity or novelty?

LRMs don't break at complexity thresholds but at instance-novelty boundaries. Models fit instance-based patterns rather than generalizable algorithms, so any reasoning chain succeeds if trained on similar instances, regardless of length.

Do large language models reason symbolically or semantically?

When semantic content is decoupled from reasoning tasks, LLM performance collapses even with correct rules in context. Models rely on parametric commonsense and token associations rather than formal logical manipulation, constraining reasoning to training distribution semantics.

Are reasoning model collapses really failures of reasoning?

Models confined to text-only generation cannot execute multi-step procedures at scale, even when they know the underlying algorithm. Tool-enabled models solve problems beyond the supposed reasoning cliff, suggesting the bottleneck is procedural execution bandwidth.

Can small models match large models on function calling?

Small models fine-tuned via DPO on correct and incorrect function-calling examples from a large teacher model achieve high accuracy on logical and mathematical tasks. DPO's explicit negative examples directly target the rigid output format failures where SFT alone underperforms.

Can models learn reasoning from predicting any text?

Quiet-STaR trains language models to generate rationales at every token position during pretraining on arbitrary internet text, enabling general reasoning without task-specific datasets. Rationale quality is judged by predictive accuracy rather than labeled correctness, allowing reasoning competence to emerge as a side effect of improved language modeling.

Research prompt for your LLMexpand ↓

Copy into ChatGPT or Claude to take this line of inquiry further — it asks the model to find newer work and re-test which earlier constraints still hold.

You are an LLM capability analyst. The question remains live: Why do open-source models trained on proprietary outputs still fail at reasoning?

What a curated library found — and when (dated claims, not current truth):
Findings span 2023–2026. The library surfaced these constraints:
  • Chain-of-thought traces function as imitation scaffolding, not genuine inference; models trained on corrupted traces perform comparably to clean ones, suggesting the trace's *form* matters more than its correctness (2024–2025).
  • Reasoning capability lives in the base model and training regime, not in copied outputs; five independent methods show post-training *elicits* latent reasoning already present (2024–2025).
  • Reasoning breakdowns track instance-level unfamiliarity, not task complexity; distillation transfers brittle instance coverage, not generalizable algorithms (2025–2026).
  • LLMs reason through learned token associations, not symbolic manipulation; decoupling semantics from logic collapses performance even with correct rules available (2023–2025).
  • Methods that *do* transfer capability (DPO on correct + incorrect examples, Quiet-STaR) reward what works, bypassing imitation entirely (2024–2025).

Anchor papers (verify; mind their dates):
  • arXiv:2305.14825 (2023-05): In-Context Semantic Reasoners rather than Symbolic Reasoners
  • arXiv:2410.18890 (2024-10): DPO-trained small models matching large ones via contrastive examples
  • arXiv:2508.01191 (2025-08): CoT as distribution-bounded imitation, not true reasoning
  • arXiv:2602.06176 (2026-02): LLM reasoning failures (latest synthesis)

Your task:
  (1) RE-TEST EACH CONSTRAINT. Has the distinction between "imitation scaffolding" and "genuine reasoning" held as models and training methods evolved since mid-2025? Do newer base models or post-training procedures (e.g., newer variants of DPO, synthetic reasoning-step generation, or hybrid distillation) now transfer reasoning *capability* rather than form? Where does the constraint still bite hardest?
  (2) Surface the strongest CONTRADICTING or SUPERSEDING work from the last ~6 months. Has any recent paper claimed successful reasoning transfer via distillation, or reported that instance-level overfitting is *not* the bottleneck? Flag disagreement explicitly.
  (3) Propose 2 research questions that assume the regime may have moved: (a) If reasoning is indeed latent in base models, how should distillation be re-designed to avoid copying surface form? (b) Can a hybrid approach—combining negative examples (DPO-style) with structural constraints on reasoning tokens—outperform pure imitation *and* pure preference learning?

Cite arXiv IDs; flag anything you cannot ground in a real paper.

Next inquiring lines