SYNTHESIS NOTE
Training, RL, and Test-Time Scaling Reasoning, Retrieval, and Evaluation Model Architecture and Internals

Can models improve themselves on tasks without verifiable answers?

Most self-improvement methods require verifiable correctness signals like math or code. Can models improve on open-ended instruction tasks where right answers aren't automatically checkable? And what minimal training is needed to unlock this?

Synthesis note · 2026-02-22 · sourced from Self Refinement Self Consistency Feedback
How should we allocate compute budget at inference time?

Self-Reasoning Language Models (SRLM) addresses a bottleneck in inference-time scaling: most self-improvement methods only work on tasks with verifiable answers (math, code) where correctness can be checked automatically. For general instruction-following — open-ended tasks without deterministic answers — self-improvement has been stuck: you can't reward correctness if you can't verify it.

SRLM's solution: create a small set (~1000 samples) of "reasoning catalyst" data — demonstrations of how to transform short, shallow reasoning chains into longer, more comprehensive ones using meta-reasoning skills. This isn't training on correct answers. It's training on the process of enriching reasoning: showing the model how to unfold the implicit reasoning steps that shorter responses skip.

After training on this catalyst data alongside the original instruction-tuning dataset, the model acquires two capabilities: (a) the base task competence from instruction tuning, and (b) the ability to enrich its own reasoning. The model can then iteratively improve by generating enriched reasoning candidates for training examples, filtering with three quality selectors (no assumption about instruction type or answer format), and retraining on the improved data.

The key finding is stability: SRLM "not only enhances the model's initial performance but also ensures more stable and consistent improvements in subsequent iterations." This contrasts with prior methods that degrade or plateau after few iterations — the reasoning catalyst provides a persistent enrichment signal that doesn't exhaust.

The 1000-sample requirement is remarkably small — though Can a single training example unlock mathematical reasoning? pushes this even further for narrow domains. The difference may be task breadth: 1-shot activates math reasoning specifically, while catalyst data enables general instruction-following self-improvement. Both connect to Do base models already contain hidden reasoning ability? — the reasoning capability is latent in the pretrained model; the catalyst data doesn't teach reasoning but unlocks the ability to articulate it. Similarly, Can small models reason well by just learning output format? shows format is often the bottleneck, not capability.

The implication for the inference-scaling agenda: self-improvement at test time is not limited to domains with external verifiers. With the right catalyst data, models can improve on any task where more detailed reasoning would help — which is most tasks.

Inquiring lines that use this note as a source 8

This note is a source for these synthesized inquiries. Follow a line forward into its question, or open it to trace back to all of its sources.

Related concepts in this collection 9

This note in its neighbourhood — explore the map, then jump to a related concept in the list below.

Concept map
19 direct connections · 190 in 2-hop network ·dense cluster Open in graph ↗

Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph

your link semantically near linked from elsewhere

Related papers in this collection 8

Papers most semantically related to this note, ranked by cosine similarity in the embedding space.

Original note title

reasoning catalyst data — just 1000 demonstrations of how to enrich reasoning — enables self-improvement for general instruction tasks beyond math and code