SYNTHESIS NOTE

Can step-wise expert rewards help small models learn hard reasoning?

When small models fail on hard multi-step problems, can training them to match expert reasoning steps rather than final answers provide useful learning signals? This explores whether intermediate-step alignment might overcome the limitations of both supervised fine-tuning and outcome-based reinforcement learning.

Synthesis note · 2026-05-18 · sourced from Training Fine Tuning

Small open-source models hit a wall on hard multi-step reasoning problems. RLVR (Reinforcement Learning with Verifiable Rewards) fails when the model's success rate is effectively zero — no rollout produces the correct answer, and outcome-only supervision provides no positive signal. SFT (Supervised Fine-Tuning) overfits long demonstrations through rigid token-by-token imitation, particularly on small models where complex teacher traces exceed the student's representational capacity. Both methods fail on the same regime: small model, hard problem, no path to correctness through their standard supervision.

Supervised Reinforcement Learning (SRL) fills the gap. The framework reformulates problem-solving as generating a sequence of logical actions, with the model trained to produce an internal reasoning monologue before committing to each action. Rewards come not from final-answer correctness but from similarity between the model's actions and expert actions extracted from an SFT dataset, computed step-wise as the rollout proceeds.

The reward structure is the key shift. Outcome rewards are sparse and binary — correct or not. Step-wise similarity rewards are dense and smooth — partial credit for partial alignment with expert steps. The model receives useful signal even on problems where it never reaches the correct answer, because the gradient flows from incremental alignment with the demonstrated reasoning path rather than from final-answer matching.

This also addresses the SFT failure mode. SFT forces token-by-token imitation, which makes long expert traces brittle teaching examples for small models — one wrong predicted token derails the imitation. SRL operates at the action level, decomposing expert demonstrations into manageable steps. The model can be wrong about specific tokens while still receiving credit for action-level alignment.

The empirical result: SRL enables small models to learn problems previously unlearnable by SFT or RLVR. The method becomes most powerful as a curriculum component — SRL-then-RLVR initialization-and-refinement outperforms either method alone, with SRL building the foundation that RLVR can then sharpen.

Inquiring lines that read this note 25

This note is a source for these research framings, grouped by the broader line of inquiry each explores. Scan the bold lines of inquiry; follow any specific question forward.

How can process reward models supervise complex reasoning traces?

What structural advantages do diffusion language models offer over autoregressive methods?

Can outcome-based rewards fully replace per-step likelihood in diffusion RL training?

Does alignment training create blind spots in detecting genuine safety threats?

Does correct model behavior guarantee internal alignment of learned objectives?

What properties determine whether reward signals teach genuine reasoning?

What pretraining choices and baseline capability constrain reinforcement learning gains?

How does test-time aggregation affect reasoning correctness and reliability?

When do aggregated imperfect demonstrations fail to outperform the best expert?

Why do reward structures fail to shape long-term agent learning?

Why do next-turn reward objectives fail to encourage multi-turn goal progress?

How do self-generated feedback mechanisms enable effective model learning?

What constrains reinforcement learning's ability to expand model reasoning?

When does outcome reward signal become informative during model training?

How does example difficulty affect learning efficiency in language models?

Why does supervised fine-tuning improve accuracy while degrading reasoning quality?

Why does SFT fail when expert demonstrations are too long for small models?

Do harness improvements transfer across model scales or memorize shortcuts?

Can smaller models produce skill updates as useful as frontier model updates?

How can AI agents autonomously learn and transfer skills across tasks?

Why does delegation training help models that work alone?

Can language model RL training avoid reward hacking and misalignment?

Why do dense rewards plus hard constraints outperform single fixed rewards?

Related concepts in this collection 3

This note in its neighbourhood — explore the map, then jump to a related concept in the list below.

Concept map

18 direct connections · 125 in 2-hop network ·medium cluster Open in graph ↗

Can step-wise expert rewards help small models l… Does sequencing imitation then exploration trainin… Can curriculum learning approximate expensive proc… Why does teacher-student information asymmetry ena…

Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph

your link semantically near linked from elsewhere

Does sequencing imitation then exploration training improve reasoning? Can combining Supervised RL (expert imitation) followed by RLVR (outcome rewards) outperform either method alone on hard reasoning tasks? This explores whether curriculum ordering unlocks capabilities neither method achieves independently.
same paper, the curriculum combination
Can curriculum learning approximate expensive process supervision? Can a reverse curriculum that slides backward from task completion provide step-level insight comparable to human process annotations, but at outcome supervision cost?
adjacent: another method to bridge SFT and RLVR
Why does teacher-student information asymmetry enable learning signals? What role does privileged answer access play in making social meta-learning training work? Without asymmetric information, can a conversation between teacher and student function as pedagogy or only as parallel speculation?
adjacent: another method using expert/privileged information for small-model training

Can step-wise expert rewards help small models learn hard reasoning?

Inquiring lines that read this note 25

Related concepts in this collection 3

Related papers in this collection 8

Search by related questions 4