INQUIRING LINE

How should humans specify deterministic abstractions of RL problems?

This explores where and how humans should hand-build the fixed structure around an RL problem — the reward definition, the control flow, the skill vocabulary, the exploration scaffolding — rather than leaving everything to be learned from scratch.


This explores where humans should impose deterministic structure on an RL problem instead of trusting end-to-end learning to discover it. The corpus suggests the answer isn't one abstraction but several layers, each of which can be specified by hand — and getting each one right matters more than the learning algorithm itself.

Start with the reward, the most consequential abstraction a human writes. A naive deterministic spec — 'correct = 1, wrong = 0' — quietly teaches the wrong thing: binary correctness rewards incentivize confident guessing because they never punish confident errors, and the fix is to add a second hand-specified term (the Brier score) that makes accuracy and calibration co-optimize Does binary reward training hurt model calibration?. The lesson generalizes: the human's job isn't to write *a* reward but to write one whose fixed points are the behavior you actually want. When a clean deterministic verifier is hard to write, the corpus shows two escape hatches — semi-formal reasoning templates that verify code equivalence at 93% accuracy without running anything Can structured reasoning replace code execution for RL rewards?, and adversarial critics that replace a domain-specific verifier with a learned discriminator, letting you train reasoning even where no crisp success test exists Can adversarial critics replace task-specific verifiers for reasoning?.

The second layer is control flow. LLM Programs make the case that humans should specify the *algorithm* — the explicit steps, state, and information hiding — and let the model fill only the step-specific slots, turning an opaque task into modular, debuggable sub-tasks Can algorithms control LLM reasoning better than LLMs alone?. This is the purest form of 'deterministic abstraction': the human writes the scaffold, the policy learns inside it. RLAD pushes the same idea into exploration itself — instead of sampling deeper and deeper solution chains, it generates and trains on diverse *abstractions* that enforce breadth-first search, which beats depth-only reasoning at large compute budgets Can abstractions guide exploration better than depth alone?.

A third, underrated layer is what gets abstracted *out* into memory or a skill library rather than baked into weights. VOYAGER stores executable skills in an indexed library and composes complex behavior from simpler pieces, which is itself a human-chosen abstraction boundary — skills are the named, reusable units Can agents learn new skills without forgetting old ones?. AgentFly goes further, formalizing the whole problem as a memory-augmented MDP where credit assignment and improvement happen through case/subtask/tool memory operations and the network weights never change at all Can agents learn continuously from experience without updating weights?. Choosing *which* MDP you're solving — what counts as state, what counts as an action — is the most deterministic abstraction of all, and these papers show it can live outside the parameters entirely.

The quiet thread across all of this: RL mostly doesn't create new capability, it deploys latent capability — base models already contain reasoning strategies and RL learns *when* to fire them, not *how* Does RL post-training create reasoning or just deploy it?, and the updates touch only a sparse-but-structured 5–30% of parameters that's near-identical across seeds Does reinforcement learning update only a small fraction of parameters?. If RL is largely routing pre-existing skills, then the human-specified abstractions — reward shape, control flow, skill vocabulary, the MDP boundary — are doing the real design work. That reframes the question's premise: you're not abstracting an RL problem so the algorithm can solve it; you're abstracting it because the abstraction *is* most of the solution, and the learning just tunes the dispatch.


Sources 9 notes

Does binary reward training hurt model calibration?

Binary correctness rewards incentivize high-confidence guessing because they don't penalize confident wrong answers. Adding the Brier score as a second reward term mathematically guarantees joint optimization of accuracy and calibration without trade-off.

Can structured reasoning replace code execution for RL rewards?

Semi-formal reasoning templates enable execution-free patch equivalence verification at 93% accuracy on real agent code, crossing the reliability threshold needed for RL reward signals. This makes execution-free verification viable for certain task classes like fault localization and code reasoning.

Can adversarial critics replace task-specific verifiers for reasoning?

RARO uses an adversarial game where a critic discriminates expert from policy answers, eliminating the need for domain-specific verifiers while matching the scaling properties of verifier-based RL. The approach works across Countdown, DeepMath, and Poetry Writing tasks.

Can algorithms control LLM reasoning better than LLMs alone?

LLM Programs embed LLMs within explicit algorithms that manage control flow and state, presenting only step-specific context to each LLM call. This information hiding addresses capability and context window limits while treating complex reasoning as modular, debuggable sub-tasks.

Can abstractions guide exploration better than depth alone?

RLAD jointly trains abstraction and solution generators, showing that allocating test-time compute to diverse abstractions outperforms parallel solution sampling at large budgets. Abstractions create structured breadth-first exploration that prevents the underthinking failure mode of depth-only reasoning chains.

Can agents learn new skills without forgetting old ones?

VOYAGER demonstrates that storing executable skills in an embedding-indexed library and composing complex skills from simpler ones allows agents to learn continuously while avoiding the forgetting that occurs with weight-update-based methods. Environmental feedback refines skills while an automatic curriculum drives continual exploration.

Can agents learn continuously from experience without updating weights?

AgentFly formalizes agent learning as a Memory-augmented MDP with three memory modules (case, subtask, tool) that enable credit assignment and policy improvement entirely through memory operations. The approach achieved 87.88% on GAIA validation without modifying LLM parameters.

Does RL post-training create reasoning or just deploy it?

Evidence shows base models already contain reasoning capability in latent form; RL training optimizes deployment timing rather than capability creation. Hybrid models recover 91% of performance gains by routing tokens only, and activation vectors for reasoning strategies pre-exist before any RL.

Does reinforcement learning update only a small fraction of parameters?

Across seven RL algorithms and ten LLM families, RL induces intrinsic parameter sparsity of 5–30% without explicit regularization. Critically, these sparse updates are nearly full-rank and nearly identical across random seeds, indicating structural rather than arbitrary parameter selection.

Research prompt for your LLMexpand ↓

Copy into ChatGPT or Claude to take this line of inquiry further — it asks the model to find newer work and re-test which earlier constraints still hold.

You are an RL systems analyst. The question remains open: **Where should humans impose deterministic structure on RL problems—and how has capability progress since mid-2024 changed what *must* be hand-specified versus what can be learned?**

What a curated library found — and when (dated claims, not current truth):
Findings span 2024–2026; treat each as a constraint snapshot:

- Reward shape is the most consequential human abstraction; naive binary rewards degrade calibration, but adding proper scoring rules (Brier score) co-optimizes accuracy and confidence (~2024).
- Deterministic verifiers are hard to write; workarounds include semi-formal reasoning templates (93% code-equivalence accuracy without execution) and learned adversarial critics that replace domain-specific reward (~2024–2025).
- Control flow: humans should specify the *algorithm* (steps, state, information hiding) and let the model fill step-specific slots; this turns opaque tasks into modular, debuggable sub-tasks (~2024–2025).
- RL updates only touch 5–30% of parameters in sparse-but-structured subnetworks; RL deploys latent capability (already in base models), not creates it; learning tunes *when* to fire skills, not *how* (~2025).
- Memory and skill libraries (VOYAGER, AgentFly) externalize abstraction boundaries—credit assignment and improvement happen through memory operations, weights unchanged (~2025–2026).

Anchor papers (verify; mind their dates):
- 2024-09: arXiv:2409.15360 (Reward-Robust RLHF)
- 2025-05: arXiv:2505.11711 (Sparse RL parameter updates)
- 2025-08: arXiv:2508.20722 (rStar2-Agent reasoning)
- 2026-04: arXiv:2604.08377 (SkillClaw collective skill evolution)

**Your task:**
(1) **RE-TEST EACH CONSTRAINT.** For every finding above, judge whether newer models, methods (reasoning scales, synthetic data, multi-agent ensembles), tooling (verifier libraries, LLM-as-verifier APIs), or evaluation have since RELAXED or OVERTURNED it. Separate durable questions (e.g., does reward shaping still matter?) from perishable limits (e.g., can adversarial critics now match hand-written verifiers?). Cite what resolved each constraint, or state plainly where it holds.

(2) **Surface the strongest CONTRADICTING or SUPERSEDING work from the last ~6 months.** Does any recent paper show that end-to-end learning *without* hand-specified abstractions now matches or beats abstraction-first approaches? Cite it.

(3) **Propose 2 research questions that ASSUME the regime may have moved.** For example: if sparse RL updates mean most capability is pre-trained, how should humans design *meta-abstractions* that guide which 5–30% gets tuned? Or: do skill libraries (VOYAGER-like) now scale better than reward shaping for long-horizon tasks?

Cite arXiv IDs; flag anything you cannot ground in a real paper.

Next inquiring lines