Can models improve themselves on tasks without verifiable answers?
Most self-improvement methods require verifiable correctness signals like math or code. Can models improve on open-ended instruction tasks where right answers aren't automatically checkable? And what minimal training is needed to unlock this?
Self-Reasoning Language Models (SRLM) addresses a bottleneck in inference-time scaling: most self-improvement methods only work on tasks with verifiable answers (math, code) where correctness can be checked automatically. For general instruction-following — open-ended tasks without deterministic answers — self-improvement has been stuck: you can't reward correctness if you can't verify it.
SRLM's solution: create a small set (~1000 samples) of "reasoning catalyst" data — demonstrations of how to transform short, shallow reasoning chains into longer, more comprehensive ones using meta-reasoning skills. This isn't training on correct answers. It's training on the process of enriching reasoning: showing the model how to unfold the implicit reasoning steps that shorter responses skip.
After training on this catalyst data alongside the original instruction-tuning dataset, the model acquires two capabilities: (a) the base task competence from instruction tuning, and (b) the ability to enrich its own reasoning. The model can then iteratively improve by generating enriched reasoning candidates for training examples, filtering with three quality selectors (no assumption about instruction type or answer format), and retraining on the improved data.
The key finding is stability: SRLM "not only enhances the model's initial performance but also ensures more stable and consistent improvements in subsequent iterations." This contrasts with prior methods that degrade or plateau after few iterations — the reasoning catalyst provides a persistent enrichment signal that doesn't exhaust.
The 1000-sample requirement is remarkably small — though Can a single training example unlock mathematical reasoning? pushes this even further for narrow domains. The difference may be task breadth: 1-shot activates math reasoning specifically, while catalyst data enables general instruction-following self-improvement. Both connect to Do base models already contain hidden reasoning ability? — the reasoning capability is latent in the pretrained model; the catalyst data doesn't teach reasoning but unlocks the ability to articulate it. Similarly, Can small models reason well by just learning output format? shows format is often the bottleneck, not capability.
The implication for the inference-scaling agenda: self-improvement at test time is not limited to domains with external verifiers. With the right catalyst data, models can improve on any task where more detailed reasoning would help — which is most tasks.
Inquiring lines that use this note as a source 8
This note is a source for these synthesized inquiries. Follow a line forward into its question, or open it to trace back to all of its sources.
- Can external verification systems fix what self-verification cannot accomplish?
- How does the generation-verification gap limit AI self-improvement capabilities?
- How does benchmark performance measure translate to general self-modification ability?
- How does low verifiability change what we can measure in AI work?
- What alternatives exist when required knowledge is absent from training?
- Can multiple verification approaches together overcome the self-improvement ceiling?
- Does the generation-verification gap actually limit self-improvement in verifiable tasks?
- Why does self-verification fail but external process verification work?
Related concepts in this collection 9
This note in its neighbourhood — explore the map, then jump to a related concept in the list below.
Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph
-
Do base models already contain hidden reasoning ability?
Explores whether reasoning capability emerges during pre-training as a latent feature rather than being created by post-training methods like reinforcement learning or fine-tuning.
the catalyst principle: a small training signal unlocks what's already there
-
Can small models reason well by just learning output format?
Does reasoning performance depend primarily on adapting how models express outputs rather than acquiring new knowledge? The Tina research tests this by applying LoRA to a 1.5B model during reasoning training.
format adaptation as the key lever, not knowledge injection
-
What limits how much models can improve themselves?
Explores whether self-improvement has fundamental boundaries set by how well models can verify versus generate solutions, and what this means across different task types.
SRLM's three selectors provide approximate verification for non-verifiable tasks, partially closing the gap
-
Can we prune training data without hurting model performance?
This explores whether difficulty metrics can identify redundant training examples that can be safely removed. It matters because most datasets contain massive waste — if we can find which examples are truly necessary, we could train better models on far less data.
catalyst data represents the extreme end of data value concentration: the 1000 demonstrations are the irreducibly necessary examples that data pruning's difficulty metrics would identify as highest-value; both establish that training data value follows a power-law distribution
-
Can we train better models on less data?
Can gradient-based influence estimation identify which instruction data actually matters most? The research explores whether selecting small subsets of training data by their similarity to target capabilities might outperform training on everything.
complementary data efficiency findings: LESS identifies the right 5% of existing data via gradient similarity, catalyst data shows 1000 purpose-built demonstrations suffice for reasoning enrichment; LESS provides the principled selection mechanism that could identify which catalyst-like examples matter most
-
Does RL teach reasoning or just when to use it?
Does reinforcement learning in thinking models actually create new reasoning abilities, or does it simply teach existing capabilities when to activate? This matters for understanding where reasoning truly emerges.
catalyst data extends the when-not-how thesis to self-improvement: 1000 demonstrations teach the model when and how to enrich its own reasoning, not reasoning capability itself; the catalyst is an activation signal for latent enrichment ability, paralleling how RL teaches deployment timing rather than reasoning execution
-
Can a single training example unlock mathematical reasoning?
Explores whether one example is enough to dramatically improve math problem-solving in language models, and whether learning continues after perfect memorization.
pushes the minimal-data frontier even further: 1-shot RLVR achieves 37-point MATH500 gains from a single example, suggesting catalyst data's 1000-demonstration requirement may reflect task breadth rather than a fundamental data floor
-
What makes test-time training actually work in practice?
Test-time training achieved striking gains on ARC tasks, but which components are truly essential? This explores what happens when you remove each of the three key ingredients.
catalyst data may address TTT's first component (task-similar finetuning) more efficiently than general data selection: 1000 demonstrations of reasoning enrichment provide a compact task-similar foundation that TTT can refine per-instance, and the stability of catalyst-based self-improvement suggests the auxiliary task format (TTT's second component) could be reasoning enrichment itself
-
Can careful selection of 78 demos outperform massive training datasets?
Does strategic curation of high-quality demonstrations unlock agentic capability more efficiently than scaling training data? LIMI achieved 73.5% on AgencyBench with 78 samples versus 10K+ samples for competing models, suggesting data quality may matter more than quantity.
extends the minimal-data principle from reasoning to agency: 78 curated trajectories activate agentic behavior just as 1000 catalyst demonstrations activate reasoning enrichment; together they establish a cross-domain pattern where capability activation requires showing the model what the capability looks like
Related papers in this collection 8
Papers most semantically related to this note, ranked by cosine similarity in the embedding space.
- Self-Reasoning Language Models: Unfold Hidden Reasoning Chains with Few Reasoning Catalyst
- Mind the Gap: Examining the Self-Improvement Capabilities of Large Language Models
- Self-Improving Model Steering
- CoT-Self-Instruct: Building high-quality synthetic prompts for reasoning and non-reasoning tasks
- Selective Reflection-Tuning: Student-Selected Data Recycling for LLM Instruction-Tuning
- Chain-of-thought Reasoning Is A Policy Improvement Operator
- Can Large Reasoning Models Self-Train?
- Hyperagents
Original note title
reasoning catalyst data — just 1000 demonstrations of how to enrich reasoning — enables self-improvement for general instruction tasks beyond math and code