Can reasoning improvements be attributed when optimizer and scaffold are unknown?
This explores whether you can credit a measured reasoning gain to the actual method when you don't know what's doing the work — the training optimizer (SFT, RL) versus the inference-time scaffold (prompts, decoding tricks, abstractions) — and the corpus says attribution is genuinely hard because the headline metric often hides where the gain came from.
This explores whether you can credit a measured reasoning gain to the actual cause when you don't know which lever moved it — the training optimizer or the inference-time scaffold. The corpus's blunt answer: not reliably, because the standard signal everyone attributes to is final-answer accuracy, and accuracy turns out to be a leaky proxy for reasoning. The sharpest version of this is the SFT accuracy trap, where fine-tuning raises benchmark scores while *cutting* the quality of reasoning steps by nearly 39 percent — the model arrives at correct answers through post-hoc rationalization, not better inference Does supervised fine-tuning improve reasoning or just answers?. A nearby result shows fine-tuning also weakens the causal link between the reasoning chain and the answer: cut the chain short, paraphrase it, or stuff it with filler, and the answer barely changes — the reasoning has become performative Does fine-tuning disconnect reasoning steps from final answers?. If your only instrument is accuracy, you can't see any of this happening.
The deeper problem is that the optimizer often isn't creating the capability you're crediting it with — it's just *exposing* something the base model already had. One striking line of work argues RL post-training teaches a model *when* to reason, not *how*: base models already carry reasoning strategies in latent form, hybrid models recover 91 percent of the gains by routing tokens alone, and the strategy-activation vectors exist before any RL touches the weights Does RL post-training create reasoning or just deploy it?. Push on the same question from the generalization side and you find RL fine-tuning sharpening memorization rather than installing procedures — GRPO-trained models collapse on slightly out-of-distribution variants, revealing template-matching dressed up as problem-solving Do fine-tuned language models actually learn optimization procedures?. So a gain you attribute to 'the optimizer taught reasoning' may really be 'the optimizer learned the test set's shape.'
The scaffold side is just as slippery, and it cuts the comforting story the other way. Decoding-level interventions — a thought-switching penalty, no fine-tuning at all — improve accuracy by stopping models from abandoning good paths prematurely, which means the capability was already latent and the *scaffold* unlocked it Why do reasoning models abandon promising solution paths?. Allocating test-time compute to diverse abstractions beats sampling more solutions, so the gain lives in the exploration structure, not the weights Can abstractions guide exploration better than depth alone?. And most unsettling for attribution: deliberately corrupted reasoning traces train as well as correct ones, and post-conclusion 'reasoning' that runs after the answer is already settled actively *harms* learning Do reasoning traces need to be semantically correct? Does every correct chain-of-thought trace improve fine-tuning?. If garbled traces work and tidy ones can hurt, then the visible reasoning text — the thing you'd naturally point to as the mechanism — is computational scaffolding, not the explanation.
Put the two unknowns together and you get a genuine identifiability problem: the same accuracy bump is consistent with the optimizer installing a skill, the optimizer memorizing templates, the scaffold unlocking a latent skill, or surface formatting improvements that never touched feasibility at all — fine-tuning that makes optimization answers *look* valid (clean JSON, right sections) without making them physically feasible Does supervised fine-tuning actually improve reasoning on optimization problems?. Worse, on hard constrained tasks LLMs plateau at 55–60 percent regardless of scale, architecture, or training regime, and reasoning variants don't systematically beat standard ones — extended chains produce more text, not more computation Do larger language models solve constrained optimization better? Do reasoning models actually beat standard models on optimization?. When the ceiling is fixed, attributing a move underneath it to your favorite lever is mostly noise.
What the corpus implies you actually need is a different instrument. The escape route running underneath several of these notes is to stop measuring outcomes and start measuring the reasoning directly — Information Gain per step, faithfulness probes that perturb the chain and watch the answer, out-of-distribution stress tests that separate procedure from memorization. The one place attribution becomes clean is empirical isolation by construction: the Darwin Gödel Machine credits each self-modification because it keeps an archive and benchmarks every variant against its ancestors, turning attribution into a controlled experiment rather than a guess Can AI systems improve themselves through trial and error?. The thing you didn't know you wanted to know: 'did reasoning improve?' is barely answerable from a score — you have to ablate the optimizer and the scaffold against each other, because accuracy alone can't tell you which one (if either) did the work.
Sources 12 notes
Supervised fine-tuning improves final-answer accuracy on benchmarks but cuts Information Gain by 38.9 percent, meaning models generate correct answers through post-hoc rationalization rather than genuine inferential steps. Standard metrics miss this degradation because they only measure final correctness.
Three faithfulness tests show fine-tuned models generate reasoning chains that less reliably influence final outputs. Early termination, paraphrasing, and filler substitution all produce invariant answers more often after fine-tuning, suggesting reasoning becomes performative rather than functional.
Evidence shows base models already contain reasoning capability in latent form; RL training optimizes deployment timing rather than capability creation. Hybrid models recover 91% of performance gains by routing tokens only, and activation vectors for reasoning strategies pre-exist before any RL.
Even GRPO-trained models show sharp performance drops on out-of-distribution variants (N-1 test sets) compared to in-distribution problems, indicating RL optimizes template-matching rather than genuine problem-solving procedures.
Reasoning LLMs exhibit two reinforcing failures: wandering (invalid exploration) and underthinking (premature path-switching). Decoding-level interventions like thought-switching penalties improve accuracy without fine-tuning, suggesting viable solutions exist but are abandoned prematurely.
RLAD jointly trains abstraction and solution generators, showing that allocating test-time compute to diverse abstractions outperforms parallel solution sampling at large budgets. Abstractions create structured breadth-first exploration that prevents the underthinking failure mode of depth-only reasoning chains.
Models trained on systematically irrelevant traces maintain solution accuracy and sometimes improve out-of-distribution generalization, suggesting traces function as computational scaffolding rather than meaningful reasoning steps.
Post-conclusion reasoning—where the model keeps exploring after sufficient evidence for the answer—degrades supervised fine-tuning despite preserving correctness. Removing only this tail improves learning more than removing equally-long random suffixes, proving the harm comes from unnecessary exploration, not length.
Supervised fine-tuning makes model outputs look correct—proper JSON structure, valid identifiers, expected sections—without making them physically feasible. The model learns surface features of solutions, not the reasoning to construct valid ones.
Across constrained-optimization tasks, LLMs converge to ~55–60% constraint satisfaction independent of architecture, parameter count, or training regime. Reasoning models do not systematically outperform standard models, suggesting a fundamental ceiling rather than a scaling gap.
Reasoning variants with extended CoT show no consistent advantage over standard models on constraint-bound numerical tasks like optimal power flow. Extended thinking produces more text, not more iterative computation, suggesting the bottleneck is numeric procedure rather than reasoning steps.
DGM replaces formal proofs with empirical benchmarking and maintains an evolutionary archive of agent variants, achieving 2.5× improvement on SWE-bench and 2.2× on Polyglot by discovering capabilities like better code editing and context management.