INQUIRING LINE

How does scaffolding unstable mechanics improve reinforcement learning for search?

This reads the question as asking how external supports — diverse demonstrations, structured feedback, differential trajectory handling, memory — can stabilize the parts of reinforcement learning that break down when you train models to search.


This explores why RL for search agents has fragile, unstable mechanics in the first place, and what scaffolds the corpus offers to shore them up. The core instability is the same one that plagues RL for reasoning: as training rewards a narrow band of winning strategies, the policy collapses toward them and exploration diversity gets squeezed out Does reinforcement learning squeeze exploration diversity in search agents?. A search agent that has lost its breadth stops probing alternative paths — which is exactly the behavior search depends on. The first scaffold, then, is supervised fine-tuning on diverse demonstrations, which preserves the exploration breadth that pure reward maximization erodes.

A second instability is that the reward signal itself is too thin to teach what went wrong. Binary correctness rewards quietly degrade calibration, pushing models toward confident guessing because a confident wrong answer isn't penalized any more than a hesitant one — adding a proper scoring rule like Brier as a second reward term repairs this without trading off accuracy Does binary reward training hurt model calibration?. And when numerical rewards plateau, the missing ingredient is information about *why* a search failed; natural-language critiques let a stuck model recover solutions that no scalar reward could coax out Can natural language feedback overcome numerical reward plateaus?. Both are scaffolds in the literal sense: richer signals propping up a reward function that's structurally too crude.

The most interesting scaffolds operate on the trajectories themselves. Rather than feeding the policy only clean, shortcut solutions, journey learning trains on the whole messy exploration — failed attempts, backtracking, self-correction — which teaches a more robust search process instead of memorized answers Can models learn better by training on messy exploration paths?. SkillRL pushes this further by processing successes and failures *asymmetrically*: successful episodes become concrete demonstrations, failures become abstracted lessons, mirroring how human experts actually consolidate experience while using far less context Should successful and failed episodes be processed differently?. Tree search adds structure from the other direction — MCTS naturally ranks paths by success, manufacturing dense process-level reward signals that normally require expensive human annotation Can tree search replace human feedback in LLM training?.

There's a reason scaffolding helps so disproportionately, and it's visible in how RL actually changes a model. RL touches only 5–30% of parameters, and it does so in structured, reproducible subnetworks rather than rewriting the model wholesale Does reinforcement learning update only a small fraction of parameters? — and it largely sharpens sampling toward solutions the base model could already reach rather than expanding the frontier Does RLVR actually expand what models can reason about?. If RL mostly reweights existing capability, then the scaffold isn't a side dish — it's where genuinely new search behavior has to come from. Training also moves through phases: execution correctness dominates early, then strategic planning becomes the bottleneck Does RL training follow a predictable two-phase learning sequence?, which tells you *when* each kind of scaffold pays off.

The lateral surprise is that the strongest stabilizer may be to move learning out of the weights entirely. Externalized skill libraries let agents compose new search skills from old ones without the catastrophic forgetting that weight updates cause Can agents learn new skills without forgetting old ones?, and memory-based online RL drives continual adaptation through case/subtask/tool memory with no parameter updates at all — reaching strong results on hard agentic benchmarks Can agents learn continuously from experience without updating weights?. Read together, the corpus suggests the 'unstable mechanics' of RL for search are best handled not by tuning RL harder but by surrounding it with scaffolds — diverse demos, richer feedback, asymmetric trajectory handling, and external memory — that carry the parts RL does badly.


Sources 11 notes

Does reinforcement learning squeeze exploration diversity in search agents?

RL training compresses behavioral diversity in search agents through the same entropy collapse mechanism documented in reasoning—policies converge on narrow reward-maximizing strategies. SFT on diverse demonstrations preserves exploration breadth, suggesting diversity-preservation techniques are essential for RL search scaling.

Does binary reward training hurt model calibration?

Binary correctness rewards incentivize high-confidence guessing because they don't penalize confident wrong answers. Adding the Brier score as a second reward term mathematically guarantees joint optimization of accuracy and calibration without trade-off.

Can natural language feedback overcome numerical reward plateaus?

Critique-GRPO shows that models stuck on performance plateaus can generate correct solutions when given chain-of-thought critiques, revealing that numerical rewards lack critical information about why failures occur and how to improve.

Can models learn better by training on messy exploration paths?

Research shows that training on messy trajectories—failed attempts, self-correction, and backtracking—teaches more robust reasoning than training only on shortcut solutions. This approach models o1-style deep reasoning as search internalization rather than solution memorization.

Should successful and failed episodes be processed differently?

SkillRL demonstrates that treating successful episodes as concrete demonstrations and failures as abstracted lessons achieves state-of-the-art performance on complex tasks while using substantially less context than uniform approaches. The asymmetry mirrors human expert reasoning and avoids the degradation seen in uniform consolidation methods.

Can tree search replace human feedback in LLM training?

AlphaLLM uses tree search outcomes and three critic models to derive dense reward signals equivalent to human-labeled feedback. Tree structure naturally ranks solution paths by success, replacing the annotation oracle that standard RLHF requires.

Does reinforcement learning update only a small fraction of parameters?

Across seven RL algorithms and ten LLM families, RL induces intrinsic parameter sparsity of 5–30% without explicit regularization. Critically, these sparse updates are nearly full-rank and nearly identical across random seeds, indicating structural rather than arbitrary parameter selection.

Does RLVR actually expand what models can reason about?

Pass@k analysis shows base models outperform RLVR models at high k, indicating RLVR doesn't expand solvable problems but rather narrows sampling toward solutions already in the base model's distribution. Distillation, by contrast, genuinely transfers new reasoning patterns.

Does RL training follow a predictable two-phase learning sequence?

Across eight models, RL training consistently shows a first phase where execution correctness drives learning, followed by a second phase where strategic planning becomes the bottleneck. Planning token entropy increases while execution entropy stabilizes, and concentration of optimization on planning tokens yields significant performance gains.

Can agents learn new skills without forgetting old ones?

VOYAGER demonstrates that storing executable skills in an embedding-indexed library and composing complex skills from simpler ones allows agents to learn continuously while avoiding the forgetting that occurs with weight-update-based methods. Environmental feedback refines skills while an automatic curriculum drives continual exploration.

Can agents learn continuously from experience without updating weights?

AgentFly formalizes agent learning as a Memory-augmented MDP with three memory modules (case, subtask, tool) that enable credit assignment and policy improvement entirely through memory operations. The approach achieved 87.88% on GAIA validation without modifying LLM parameters.

Next inquiring lines