Can models improve themselves on tasks without verifiable answers?

Most self-improvement methods require verifiable correctness signals like math or code. Can models improve on open-ended instruction tasks where right answers aren't automatically checkable? And what minimal training is needed to unlock this?

Synthesis note · 2026-02-22 · sourced from Self Refinement Self Consistency Feedback

Self-Reasoning Language Models (SRLM) addresses a bottleneck in inference-time scaling: most self-improvement methods only work on tasks with verifiable answers (math, code) where correctness can be checked automatically. For general instruction-following — open-ended tasks without deterministic answers — self-improvement has been stuck: you can't reward correctness if you can't verify it.

SRLM's solution: create a small set (~1000 samples) of "reasoning catalyst" data — demonstrations of how to transform short, shallow reasoning chains into longer, more comprehensive ones using meta-reasoning skills. This isn't training on correct answers. It's training on the process of enriching reasoning: showing the model how to unfold the implicit reasoning steps that shorter responses skip.

After training on this catalyst data alongside the original instruction-tuning dataset, the model acquires two capabilities: (a) the base task competence from instruction tuning, and (b) the ability to enrich its own reasoning. The model can then iteratively improve by generating enriched reasoning candidates for training examples, filtering with three quality selectors (no assumption about instruction type or answer format), and retraining on the improved data.

The key finding is stability: SRLM "not only enhances the model's initial performance but also ensures more stable and consistent improvements in subsequent iterations." This contrasts with prior methods that degrade or plateau after few iterations — the reasoning catalyst provides a persistent enrichment signal that doesn't exhaust.

The 1000-sample requirement is remarkably small — though Can a single training example unlock mathematical reasoning? pushes this even further for narrow domains. The difference may be task breadth: 1-shot activates math reasoning specifically, while catalyst data enables general instruction-following self-improvement. Both connect to Do base models already contain hidden reasoning ability? — the reasoning capability is latent in the pretrained model; the catalyst data doesn't teach reasoning but unlocks the ability to articulate it. Similarly, Can small models reason well by just learning output format? shows format is often the bottleneck, not capability.

The implication for the inference-scaling agenda: self-improvement at test time is not limited to domains with external verifiers. With the right catalyst data, models can improve on any task where more detailed reasoning would help — which is most tasks.

Inquiring lines that read this note 9

This note is a source for these research framings, grouped by the broader line of inquiry each explores. Scan the bold lines of inquiry; follow any specific question forward.

Does self-reflection enable models to reliably correct their errors?

Why does verification consistently lag behind AI generation?

Can single-axis benchmarks accurately predict agent deployment success?

How does benchmark performance measure translate to general self-modification ability?

How can models identify insufficient information and respond appropriately without guessing?

What alternatives exist when required knowledge is absent from training?

Why do self-improving systems struggle without clear external performance metrics?

Why do most self-improving systems fail when given tasks with no clear external benchmark?

Related concepts in this collection 9

This note in its neighbourhood — explore the map, then jump to a related concept in the list below.

Concept map

19 direct connections · 166 in 2-hop network ·medium cluster Open in graph ↗

Can models improve themselves on tasks without v… Do base models already contain hidden reasoning ab… Can small models reason well by just learning outp… What limits how much models can improve themselves… Can we prune training data without hurting model p… Can we train better models on less data? Does RL teach reasoning or just when to use it? Can a single training example unlock mathematical … What makes test-time training actually work in pra…

Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph

your link semantically near linked from elsewhere

Do base models already contain hidden reasoning ability? Explores whether reasoning capability emerges during pre-training as a latent feature rather than being created by post-training methods like reinforcement learning or fine-tuning.
the catalyst principle: a small training signal unlocks what's already there
Can small models reason well by just learning output format? Does reasoning performance depend primarily on adapting how models express outputs rather than acquiring new knowledge? The Tina research tests this by applying LoRA to a 1.5B model during reasoning training.
format adaptation as the key lever, not knowledge injection
What limits how much models can improve themselves? Explores whether self-improvement has fundamental boundaries set by how well models can verify versus generate solutions, and what this means across different task types.
SRLM's three selectors provide approximate verification for non-verifiable tasks, partially closing the gap
Can we prune training data without hurting model performance? This explores whether difficulty metrics can identify redundant training examples that can be safely removed. It matters because most datasets contain massive waste — if we can find which examples are truly necessary, we could train better models on far less data.
catalyst data represents the extreme end of data value concentration: the 1000 demonstrations are the irreducibly necessary examples that data pruning's difficulty metrics would identify as highest-value; both establish that training data value follows a power-law distribution
Can we train better models on less data? Can gradient-based influence estimation identify which instruction data actually matters most? The research explores whether selecting small subsets of training data by their similarity to target capabilities might outperform training on everything.
complementary data efficiency findings: LESS identifies the right 5% of existing data via gradient similarity, catalyst data shows 1000 purpose-built demonstrations suffice for reasoning enrichment; LESS provides the principled selection mechanism that could identify which catalyst-like examples matter most
Does RL teach reasoning or just when to use it? Does reinforcement learning in thinking models actually create new reasoning abilities, or does it simply teach existing capabilities when to activate? This matters for understanding where reasoning truly emerges.
catalyst data extends the when-not-how thesis to self-improvement: 1000 demonstrations teach the model when and how to enrich its own reasoning, not reasoning capability itself; the catalyst is an activation signal for latent enrichment ability, paralleling how RL teaches deployment timing rather than reasoning execution
Can a single training example unlock mathematical reasoning? Explores whether one example is enough to dramatically improve math problem-solving in language models, and whether learning continues after perfect memorization.
pushes the minimal-data frontier even further: 1-shot RLVR achieves 37-point MATH500 gains from a single example, suggesting catalyst data's 1000-demonstration requirement may reflect task breadth rather than a fundamental data floor
What makes test-time training actually work in practice? Test-time training achieved striking gains on ARC tasks, but which components are truly essential? This explores what happens when you remove each of the three key ingredients.
catalyst data may address TTT's first component (task-similar finetuning) more efficiently than general data selection: 1000 demonstrations of reasoning enrichment provide a compact task-similar foundation that TTT can refine per-instance, and the stability of catalyst-based self-improvement suggests the auxiliary task format (TTT's second component) could be reasoning enrichment itself
Can careful selection of 78 demos outperform massive training datasets? Does strategic curation of high-quality demonstrations unlock agentic capability more efficiently than scaling training data? LIMI achieved 73.5% on AgencyBench with 78 samples versus 10K+ samples for competing models, suggesting data quality may matter more than quantity.
extends the minimal-data principle from reasoning to agency: 78 curated trajectories activate agentic behavior just as 1000 catalyst demonstrations activate reasoning enrichment; together they establish a cross-domain pattern where capability activation requires showing the model what the capability looks like

Can models improve themselves on tasks without verifiable answers?

Inquiring lines that read this note 9

Related concepts in this collection 9

Related papers in this collection 8

Search by related questions 4