How should humans specify deterministic abstractions of RL problems?
This explores where and how humans should hand-build the fixed structure around an RL problem — the reward definition, the control flow, the skill vocabulary, the exploration scaffolding — rather than leaving everything to be learned from scratch.
This explores where humans should impose deterministic structure on an RL problem instead of trusting end-to-end learning to discover it. The corpus suggests the answer isn't one abstraction but several layers, each of which can be specified by hand — and getting each one right matters more than the learning algorithm itself.
Start with the reward, the most consequential abstraction a human writes. A naive deterministic spec — 'correct = 1, wrong = 0' — quietly teaches the wrong thing: binary correctness rewards incentivize confident guessing because they never punish confident errors, and the fix is to add a second hand-specified term (the Brier score) that makes accuracy and calibration co-optimize Does binary reward training hurt model calibration?. The lesson generalizes: the human's job isn't to write *a* reward but to write one whose fixed points are the behavior you actually want. When a clean deterministic verifier is hard to write, the corpus shows two escape hatches — semi-formal reasoning templates that verify code equivalence at 93% accuracy without running anything Can structured reasoning replace code execution for RL rewards?, and adversarial critics that replace a domain-specific verifier with a learned discriminator, letting you train reasoning even where no crisp success test exists Can adversarial critics replace task-specific verifiers for reasoning?.
The second layer is control flow. LLM Programs make the case that humans should specify the *algorithm* — the explicit steps, state, and information hiding — and let the model fill only the step-specific slots, turning an opaque task into modular, debuggable sub-tasks Can algorithms control LLM reasoning better than LLMs alone?. This is the purest form of 'deterministic abstraction': the human writes the scaffold, the policy learns inside it. RLAD pushes the same idea into exploration itself — instead of sampling deeper and deeper solution chains, it generates and trains on diverse *abstractions* that enforce breadth-first search, which beats depth-only reasoning at large compute budgets Can abstractions guide exploration better than depth alone?.
A third, underrated layer is what gets abstracted *out* into memory or a skill library rather than baked into weights. VOYAGER stores executable skills in an indexed library and composes complex behavior from simpler pieces, which is itself a human-chosen abstraction boundary — skills are the named, reusable units Can agents learn new skills without forgetting old ones?. AgentFly goes further, formalizing the whole problem as a memory-augmented MDP where credit assignment and improvement happen through case/subtask/tool memory operations and the network weights never change at all Can agents learn continuously from experience without updating weights?. Choosing *which* MDP you're solving — what counts as state, what counts as an action — is the most deterministic abstraction of all, and these papers show it can live outside the parameters entirely.
The quiet thread across all of this: RL mostly doesn't create new capability, it deploys latent capability — base models already contain reasoning strategies and RL learns *when* to fire them, not *how* Does RL post-training create reasoning or just deploy it?, and the updates touch only a sparse-but-structured 5–30% of parameters that's near-identical across seeds Does reinforcement learning update only a small fraction of parameters?. If RL is largely routing pre-existing skills, then the human-specified abstractions — reward shape, control flow, skill vocabulary, the MDP boundary — are doing the real design work. That reframes the question's premise: you're not abstracting an RL problem so the algorithm can solve it; you're abstracting it because the abstraction *is* most of the solution, and the learning just tunes the dispatch.
Sources 9 notes
Binary correctness rewards incentivize high-confidence guessing because they don't penalize confident wrong answers. Adding the Brier score as a second reward term mathematically guarantees joint optimization of accuracy and calibration without trade-off.
Semi-formal reasoning templates enable execution-free patch equivalence verification at 93% accuracy on real agent code, crossing the reliability threshold needed for RL reward signals. This makes execution-free verification viable for certain task classes like fault localization and code reasoning.
RARO uses an adversarial game where a critic discriminates expert from policy answers, eliminating the need for domain-specific verifiers while matching the scaling properties of verifier-based RL. The approach works across Countdown, DeepMath, and Poetry Writing tasks.
LLM Programs embed LLMs within explicit algorithms that manage control flow and state, presenting only step-specific context to each LLM call. This information hiding addresses capability and context window limits while treating complex reasoning as modular, debuggable sub-tasks.
RLAD jointly trains abstraction and solution generators, showing that allocating test-time compute to diverse abstractions outperforms parallel solution sampling at large budgets. Abstractions create structured breadth-first exploration that prevents the underthinking failure mode of depth-only reasoning chains.
VOYAGER demonstrates that storing executable skills in an embedding-indexed library and composing complex skills from simpler ones allows agents to learn continuously while avoiding the forgetting that occurs with weight-update-based methods. Environmental feedback refines skills while an automatic curriculum drives continual exploration.
AgentFly formalizes agent learning as a Memory-augmented MDP with three memory modules (case, subtask, tool) that enable credit assignment and policy improvement entirely through memory operations. The approach achieved 87.88% on GAIA validation without modifying LLM parameters.
Evidence shows base models already contain reasoning capability in latent form; RL training optimizes deployment timing rather than capability creation. Hybrid models recover 91% of performance gains by routing tokens only, and activation vectors for reasoning strategies pre-exist before any RL.
Across seven RL algorithms and ten LLM families, RL induces intrinsic parameter sparsity of 5–30% without explicit regularization. Critically, these sparse updates are nearly full-rank and nearly identical across random seeds, indicating structural rather than arbitrary parameter selection.