Can step-wise expert rewards help small models learn hard reasoning?
When small models fail on hard multi-step problems, can training them to match expert reasoning steps rather than final answers provide useful learning signals? This explores whether intermediate-step alignment might overcome the limitations of both supervised fine-tuning and outcome-based reinforcement learning.
Small open-source models hit a wall on hard multi-step reasoning problems. RLVR (Reinforcement Learning with Verifiable Rewards) fails when the model's success rate is effectively zero — no rollout produces the correct answer, and outcome-only supervision provides no positive signal. SFT (Supervised Fine-Tuning) overfits long demonstrations through rigid token-by-token imitation, particularly on small models where complex teacher traces exceed the student's representational capacity. Both methods fail on the same regime: small model, hard problem, no path to correctness through their standard supervision.
Supervised Reinforcement Learning (SRL) fills the gap. The framework reformulates problem-solving as generating a sequence of logical actions, with the model trained to produce an internal reasoning monologue before committing to each action. Rewards come not from final-answer correctness but from similarity between the model's actions and expert actions extracted from an SFT dataset, computed step-wise as the rollout proceeds.
The reward structure is the key shift. Outcome rewards are sparse and binary — correct or not. Step-wise similarity rewards are dense and smooth — partial credit for partial alignment with expert steps. The model receives useful signal even on problems where it never reaches the correct answer, because the gradient flows from incremental alignment with the demonstrated reasoning path rather than from final-answer matching.
This also addresses the SFT failure mode. SFT forces token-by-token imitation, which makes long expert traces brittle teaching examples for small models — one wrong predicted token derails the imitation. SRL operates at the action level, decomposing expert demonstrations into manageable steps. The model can be wrong about specific tokens while still receiving credit for action-level alignment.
The empirical result: SRL enables small models to learn problems previously unlearnable by SFT or RLVR. The method becomes most powerful as a curriculum component — SRL-then-RLVR initialization-and-refinement outperforms either method alone, with SRL building the foundation that RLVR can then sharpen.
Inquiring lines that use this note as a source 22
This note is a source for these synthesized inquiries. Follow a line forward into its question, or open it to trace back to all of its sources.
- How do outcome and process rewards differ in their treatment of intermediate steps?
- Can outcome-based rewards fully replace per-step likelihood in diffusion RL training?
- Does correct model behavior guarantee internal alignment of learned objectives?
- Can multi-turn rewards fix models that lose track midway?
- What breaks when you apply reinforcement learning after supervised fine-tuning?
- When do aggregated imperfect demonstrations fail to outperform the best expert?
- What information do numerical rewards fail to provide for reasoning tasks?
- Why do next-turn reward objectives fail to encourage multi-turn goal progress?
- Why do models follow a two-phase pattern of procedural then strategic learning?
- Why does imitation learning alone plateau without outcome-based refinement?
- When does outcome reward signal become informative during model training?
- Why does belief-shift reward enable smaller models to match larger baselines?
- How does a challenger's escalating difficulty function as curriculum?
- Why do medium-difficulty problems produce more stable learning gains?
- Why does step-level expert alignment work when outcome-only RL fails?
- What does process supervision reveal about step-level reasoning versus outcome rewards?
- Does the productive difficulty band ever stabilize during training?
- How does difficulty-adaptive curriculum learning change which samples get selected during training?
- How does the optimal difficulty band shift as the model's capabilities improve during training?
- Why does SFT fail when expert demonstrations are too long for small models?
- What makes step-wise rewards denser than final-answer correctness signals?
- Can smaller models produce skill updates as useful as frontier model updates?
Related concepts in this collection 3
This note in its neighbourhood — explore the map, then jump to a related concept in the list below.
Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph
-
Does sequencing imitation then exploration training improve reasoning?
Can combining Supervised RL (expert imitation) followed by RLVR (outcome rewards) outperform either method alone on hard reasoning tasks? This explores whether curriculum ordering unlocks capabilities neither method achieves independently.
same paper, the curriculum combination
-
Can curriculum learning approximate expensive process supervision?
Can a reverse curriculum that slides backward from task completion provide step-level insight comparable to human process annotations, but at outcome supervision cost?
adjacent: another method to bridge SFT and RLVR
-
Why does teacher-student information asymmetry enable learning signals?
What role does privileged answer access play in making social meta-learning training work? Without asymmetric information, can a conversation between teacher and student function as pedagogy or only as parallel speculation?
adjacent: another method using expert/privileged information for small-model training
Related papers in this collection 8
Papers most semantically related to this note, ranked by cosine similarity in the embedding space.
- Supervised Reinforcement Learning: From Expert Trajectories to Step-wise Reasoning
- RLVMR: Reinforcement Learning with Verifiable Meta-Reasoning Rewards for Robust Long-Horizon Agents
- A Survey on Test-Time Scaling in Large Language Models: What, How, Where, and How Well?
- Understanding and Mitigating Premature Confidence for Better LLM Reasoning
- Beyond the Trade-off: Self-Supervised Reinforcement Learning for Reasoning Models' Instruction Following
- StepWiser: Stepwise Generative Judges for Wiser Reasoning
- RM-R1: Reward Modeling as Reasoning
- Tina: Tiny Reasoning Models via LoRA
Original note title
supervised RL provides step-wise expert-similarity rewards that yield learning signal even when all rollouts fail — bridges the SFT-RLVR gap for small models on hard reasoning