SYNTHESIS NOTE

Topics›Reasoning Methods CoT ToT›this note

Can small models reason well by just learning output format?

Does reasoning performance depend primarily on adapting how models express outputs rather than acquiring new knowledge? The Tina research tests this by applying LoRA to a 1.5B model during reasoning training.

Synthesis note · 2026-02-22 · sourced from Reasoning Methods CoT ToT

The Tina paper trains a 1.5B parameter model with LoRA (low-rank adaptation) applied during RL post-training, keeping the base model weights frozen except for the LoRA modules. This model achieves reasoning performance competitive with — and sometimes surpassing — full-parameter RL reasoning models trained on the same base, despite using a tiny fraction of post-training compute.

The authors' hypothesis for why LoRA works so well is the Rapid Reasoning Format Adaptation Hypothesis: what RL post-training primarily teaches a small model is not new knowledge about the world, but how to organize its outputs in a reasoning-trace format. LoRA, which modifies only a low-dimensional subspace of the weight matrix, is sufficient to adapt the output format while the base model's pre-existing knowledge remains intact.

This hypothesis is supported by two independent lines of evidence. First, small LMs can store less factual knowledge than large ones but can still reason effectively — suggesting reasoning and knowledge are separable capabilities. Second, RL post-training on derivational traces selects for outputs that match reasoning-trace style while producing correct answers, but the selection pressure is on format, not on knowledge retrieval.

The practical implication: if you want to add reasoning capability to a deployed model cheaply, LoRA RL post-training may be sufficient. Full-parameter post-training is appropriate when knowledge integration is needed (new domain facts, new task-specific capabilities). Format adaptation can be achieved with a small fraction of that compute.

This is both an optimization for Can simple rewards alone teach complex domain reasoning? and a qualification: what RL "emerges" may be mostly format discovery, not new knowledge. The emergence finding is real, but its mechanism may be simpler than it looks — the model already had the knowledge; RL teaches it to express that knowledge in a productive output format.

Note: this is an OPEN hypothesis pending validation on broader task and model ranges.

Inquiring lines that read this note 23

This note is a source for these research framings, grouped by the broader line of inquiry each explores. Scan the bold lines of inquiry; follow any specific question forward.

Can AI-generated outputs constitute genuine knowledge or valid claims?

Why do different AI models generate similar outputs independently?

Why do benchmark improvements fail to reflect actual reasoning quality?

Does model scaling alone produce compositional generalization without symbolic mechanisms?

Do larger models develop more abstract features than smaller ones?

Why does training format shape reasoning strategy more than domain content?

How does example difficulty affect learning efficiency in language models?

Can smaller models actually perform well on specific downstream tasks?

How do training data properties shape reasoning capability development?

What makes training data quality more important than quantity for reasoning?

Can inference-time compute substitute for scaling up model parameters?

Does trading model size for inference steps improve overall efficiency scaling?

Can ensemble evaluation methods reduce bias more than single judges?

How does evaluation format change what we measure about model reasoning?

Why does finetuning cause catastrophic forgetting of model capabilities?

When should full-parameter post-training be used instead of LoRA adaptation?

Does fine-tuning modify underlying model capabilities or only behavioral outputs?

How does the functional separation of knowledge and reasoning affect adaptation methods?

When does architectural design matter more than raw model capacity?

Do small models show different parameter efficiency patterns than large models?

How do training priors constrain what context information can override?

What is the difference between changing model outputs versus changing internal representations?

Does reinforcement learning teach reasoning or just when to reason?

How does RPT compare to learning when versus how to deploy reasoning?

What actually drives chain-of-thought reasoning improvements in language models?

Does CoT reasoning actually cause the outputs that follow it?

How should we design LLM systems to maintain alignment and control?

How do aligned LoRA adapters compose through parameter-space arithmetic?

What capability tradeoffs emerge when scaling model reasoning abilities?

Does adding reasoning to models degrade other capabilities like rule inference?

Related concepts in this collection 3

This note in its neighbourhood — explore the map, then jump to a related concept in the list below.

Concept map

14 direct connections · 151 in 2-hop network ·dense cluster Open in graph ↗

Can small models reason well by just learning ou… Can simple rewards alone teach complex domain reas… Does training data format shape reasoning strategy… Do reasoning traces actually cause correct answers…

Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph

your link semantically near linked from elsewhere

Can simple rewards alone teach complex domain reasoning? Does reinforcement learning on difficult problems with basic accuracy rewards produce sophisticated reasoning strategies without explicit chain-of-thought training? This challenges assumptions about what domain AI models need to learn effectively.
LoRA's efficiency suggests the "emergence" may be format discovery from pre-existing knowledge, not genuine capability emergence; a qualification, not a contradiction
Does training data format shape reasoning strategy more than domain? What explains why models trained on multiple-choice data reason differently than those trained on free-form text? The research isolates format and domain effects to measure which one matters more.
both findings point to format as the key lever: CoT Encyclopedia shows training *input* format shapes strategy; Tina shows training *output* format shapes reasoning capability
Do reasoning traces actually cause correct answers? Explores whether the intermediate 'thinking' tokens in R1-style models genuinely drive reasoning or merely mimic its appearance. Matters because false confidence in invalid traces could mask errors.
LoRA learning format efficiently is additional evidence that reasoning traces are primarily style artifacts, not deep computational structures

Related papers in this collection 8

Papers most semantically related to this note, ranked by cosine similarity in the embedding space.

Original note title

lora-based reasoning format adaptation achieves competitive reasoning by adapting output format rather than integrating knowledge

Can small models reason well by just learning output format?

Inquiring lines that read this note 23

Related concepts in this collection 3

Related papers in this collection 8

Search by related questions 4