Can reasoning in free text then formatting separately recover performance?
This explores whether separating the act of reasoning (in unconstrained free text) from the act of producing formatted output recovers accuracy that formatting constraints otherwise destroy.
This explores whether letting a model think in free text first, then format separately, recovers performance lost when reasoning and formatting are forced to happen at once. The corpus suggests the answer is largely yes — and explains *why* the gain shows up. The sharpest evidence is that format compliance actively destroys reasoning when the two are entangled: models trained to hide their reasoning compute the correct answer in their early layers, then *overwrite* those representations in later layers to emit format-compliant filler Do transformers hide reasoning before producing filler tokens?. The reasoning was there; the formatting demand suppressed it. That's a direct mechanistic case for keeping the two stages apart — once you stop asking a single pass to both think and conform, the answer stops getting clobbered.
Why formatting interferes at all is its own thread. Format isn't a neutral wrapper — it steers the reasoning strategy itself, about 7.5× more strongly than the actual subject matter does. Multiple-choice formatting pushes models into shallow breadth-first scanning, while free-form generation produces deeper depth-first reasoning Does training data format shape reasoning strategy more than domain?, What makes chain-of-thought reasoning actually work?. So free-text reasoning isn't just unconstrained space — it appears to unlock a qualitatively different (and often better) reasoning mode that a rigid output schema would have shut down. Separating the stages lets you have the depth-first thinking *and* the clean final format.
There's a complementary angle worth knowing: for small models, the failure under formatting pressure is specifically a *format* failure, not a reasoning one. Models fine-tuned with DPO on correct-vs-incorrect function-calling pairs beat plain supervised fine-tuning precisely because the negative examples target rigid output-format mistakes that the model's underlying logic would otherwise get right Can small models match large models on function calling?. This reframes 'recover performance' — sometimes you're not recovering reasoning capability, you're rescuing a correct answer from a formatting stumble, which is exactly the case where decoupling helps most.
The caveat the corpus raises: free reasoning space isn't free. Structured templates — explicit premises, code-path traces, evidence checks — beat unstructured free-form thinking on reliability, lifting patch-correctness from 78% to 88% by catching cases free reasoning missed Can structured templates make code reasoning more reliable than free-form thinking?. And unconstrained reasoning can crowd out the context an agent needs for later steps Does limiting reasoning per turn improve multi-turn search quality?. So the win isn't 'free text always beats structure' — it's that the *output formatting* constraint and the *reasoning* process shouldn't fight over the same tokens. You can even compress the free reasoning afterward without losing accuracy, since most chain-of-thought tokens serve documentation, not computation Can minimal reasoning chains match full explanations?. The surprising takeaway: formatting demands can be load-bearing failures, and the cheapest fix is often just to do the thinking somewhere the formatter can't reach.
Sources 7 notes
Logit lens analysis shows models trained with hidden CoT tokens compute correct answers in layers 1-3, then actively suppress these representations in final layers to produce format-compliant filler output. The reasoning is fully recoverable from lower-ranked token predictions.
Models trained on multiple-choice data adopt breadth-first exploration (Cohen's d up to 1.5), while free-form training produces depth-first reasoning. Format effect dwarfs domain effect, meaning presentation matters far more than content type.
Research shows training format shapes reasoning strategy 7.5× more than domain, demo position swings accuracy 20%, and invalid CoT prompts work as well as valid ones. CoT is pattern-guided generation, not formal logic.
Small models fine-tuned via DPO on correct and incorrect function-calling examples from a large teacher model achieve high accuracy on logical and mathematical tasks. DPO's explicit negative examples directly target the rigid output format failures where SFT alone underperforms.
Semi-formal templates requiring explicit premises, code-path traces, and evidence checks improved patch equivalence accuracy from 78% to 88%, catching cases like function shadowing that free-form reasoning missed. Templates act as completeness certificates without formal verification.
Unrestricted reasoning within single search turns consumes context needed for subsequent retrieval rounds, degrading the agent's ability to incorporate new evidence. Setting per-turn reasoning budgets, not just overall time limits, prevents this context erosion and maintains search quality across iterations.
Chain of Draft achieves equivalent accuracy to standard chain-of-thought on arithmetic, symbolic, and commonsense tasks while using only 7.6% of tokens. The 92.4% of removed tokens served style and documentation, not computation.