Can small models reason well by just learning output format?
Does reasoning performance depend primarily on adapting how models express outputs rather than acquiring new knowledge? The Tina research tests this by applying LoRA to a 1.5B model during reasoning training.
The Tina paper trains a 1.5B parameter model with LoRA (low-rank adaptation) applied during RL post-training, keeping the base model weights frozen except for the LoRA modules. This model achieves reasoning performance competitive with — and sometimes surpassing — full-parameter RL reasoning models trained on the same base, despite using a tiny fraction of post-training compute.
The authors' hypothesis for why LoRA works so well is the Rapid Reasoning Format Adaptation Hypothesis: what RL post-training primarily teaches a small model is not new knowledge about the world, but how to organize its outputs in a reasoning-trace format. LoRA, which modifies only a low-dimensional subspace of the weight matrix, is sufficient to adapt the output format while the base model's pre-existing knowledge remains intact.
This hypothesis is supported by two independent lines of evidence. First, small LMs can store less factual knowledge than large ones but can still reason effectively — suggesting reasoning and knowledge are separable capabilities. Second, RL post-training on derivational traces selects for outputs that match reasoning-trace style while producing correct answers, but the selection pressure is on format, not on knowledge retrieval.
The practical implication: if you want to add reasoning capability to a deployed model cheaply, LoRA RL post-training may be sufficient. Full-parameter post-training is appropriate when knowledge integration is needed (new domain facts, new task-specific capabilities). Format adaptation can be achieved with a small fraction of that compute.
This is both an optimization for Can simple rewards alone teach complex domain reasoning? and a qualification: what RL "emerges" may be mostly format discovery, not new knowledge. The emergence finding is real, but its mechanism may be simpler than it looks — the model already had the knowledge; RL teaches it to express that knowledge in a productive output format.
Note: this is an OPEN hypothesis pending validation on broader task and model ranges.
Inquiring lines that use this note as a source 21
This note is a source for these synthesized inquiries. Follow a line forward into its question, or open it to trace back to all of its sources.
- Why do different AI models generate similar outputs independently?
- How should benchmarks test whether models fit algorithms or patterns?
- Do larger models develop more abstract features than smaller ones?
- How much does training data format shape what reasoning strategy emerges?
- Can smaller models actually perform well on specific downstream tasks?
- What makes training data quality more important than quantity for reasoning?
- Does trading model size for inference steps improve overall efficiency scaling?
- How does evaluation format change what we measure about model reasoning?
- When should full-parameter post-training be used instead of LoRA adaptation?
- How much does input format shape what reasoning strategy a model develops?
- How does training data format shape whether models reason in parallel or sequentially?
- How much does training data presentation format shape reasoning ability?
- How does the functional separation of knowledge and reasoning affect adaptation methods?
- Do small models show different parameter efficiency patterns than large models?
- How does training data format shape which reasoning patterns emerge in models?
- Why does training data format shape reasoning strategy more than content?
- What makes well-formatted outputs misleading as evidence of model capability?
- What is the difference between changing model outputs versus changing internal representations?
- How does RPT compare to learning when versus how to deploy reasoning?
- Does CoT reasoning actually cause the outputs that follow it?
- Can similar outputs from different systems prove they work the same way?
Related concepts in this collection 3
This note in its neighbourhood — explore the map, then jump to a related concept in the list below.
Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph
-
Can simple rewards alone teach complex domain reasoning?
Does reinforcement learning on difficult problems with basic accuracy rewards produce sophisticated reasoning strategies without explicit chain-of-thought training? This challenges assumptions about what domain AI models need to learn effectively.
LoRA's efficiency suggests the "emergence" may be format discovery from pre-existing knowledge, not genuine capability emergence; a qualification, not a contradiction
-
Does training data format shape reasoning strategy more than domain?
What explains why models trained on multiple-choice data reason differently than those trained on free-form text? The research isolates format and domain effects to measure which one matters more.
both findings point to format as the key lever: CoT Encyclopedia shows training *input* format shapes strategy; Tina shows training *output* format shapes reasoning capability
-
Do reasoning traces actually cause correct answers?
Explores whether the intermediate 'thinking' tokens in R1-style models genuinely drive reasoning or merely mimic its appearance. Matters because false confidence in invalid traces could mask errors.
LoRA learning format efficiently is additional evidence that reasoning traces are primarily style artifacts, not deep computational structures
Related papers in this collection 8
Papers most semantically related to this note, ranked by cosine similarity in the embedding space.
- Tina: Tiny Reasoning Models via LoRA
- Do Theory of Mind Benchmarks Need Explicit Human-like Reasoning in Language Models?
- ZebraLogic: On the Scaling Limits of LLMs for Logical Reasoning
- Beyond Accuracy: Evaluating the Reasoning Behavior of Large Language Models -- A Survey
- Eliciting Reasoning in Language Models with Cognitive Tools
- The Illusion of Thinking: Understanding the Strengths and Limitations of Reasoning Models via the Lens of Problem Complexity
- ProRL: Prolonged Reinforcement Learning Expands Reasoning Boundaries in Large Language Models
- Echo Chamber: RL Post-training Amplifies Behaviors Learned in Pretraining
Original note title
lora-based reasoning format adaptation achieves competitive reasoning by adapting output format rather than integrating knowledge