What makes some tasks bounded enough for reliable RL?
This explores what properties make a task 'bounded' — verifiable, decomposable, closed-ended enough — that reinforcement learning produces reliable gains rather than noise or degradation.
This explores what makes a task bounded enough for reliable RL — and the corpus keeps returning to one answer: it's not the size of the task, it's whether you can cheaply tell right from wrong at each step. The cleanest illustration is verifiability. Execution-free code reasoning only becomes a usable RL signal once structured reasoning templates cross a ~93% accuracy threshold for checking whether two patches are equivalent — below that line the reward is too noisy to train on; above it, certain task classes (fault localization, patch equivalence) suddenly become RL-tractable Can structured reasoning replace code execution for RL rewards?. Boundedness, in other words, is a property of the *verifier*, not just the problem.
The second ingredient is decomposability. A task that looks impossibly long-horizon becomes bounded if you can shatter it into minimal subtasks each small enough to check by voting — MAKER runs million-step tasks to zero errors this way, and strikingly finds that small non-reasoning models suffice once decomposition is extreme enough Can extreme task decomposition enable reliable execution at million-step scale?. The same instinct shows up in reasoning structured as recursive subtask trees, where bounding each step's working memory lets a single model sustain accuracy past its context limits Can recursive subtask trees overcome context window limits?. Reliability isn't found in the whole task; it's manufactured by carving the task into pieces whose correctness is locally decidable.
The domain itself also matters. When you compare structured domains (math, code — closed answers) against creative ones (open-ended generation), they pull entropy in opposite directions: structured training systematically *decreases* output entropy toward a correct attractor, while creative training increases it. Train them in the wrong order and the structured collapse damages the open-ended skills Does training order reshape how models handle different task types?. So 'bounded' partly means 'has a low-entropy correct answer the model can converge onto' — which is exactly why RL behaves differently on essays than on equations.
But boundedness has a ceiling worth knowing about. Even on perfectly verifiable tasks, RLVR doesn't seem to expand what a model can do — pass@k analysis shows it sharpens sampling toward solutions already latent in the base model rather than teaching genuinely new reasoning Does RLVR actually expand what models can reason about?. And the reward shape can quietly betray you: binary correct/incorrect rewards train confident wrong answers because nothing penalizes confident errors, until you add a proper scoring rule like Brier Does binary reward training hurt model calibration?. A task can be bounded and still teach the wrong lesson if the reward is mis-specified.
The encouraging counterweight is that boundedness can be *engineered* into territory that looks unbounded. Modified DAPO training doubled SWE-bench performance on genuinely stateful, multi-turn software tasks with delayed rewards Can reinforcement learning scale beyond single-turn language tasks?, and GRPO-RoC got a 14B model to frontier math by filtering noisy positive trajectories while keeping diverse failures as signal — essentially cleaning the reward channel rather than the task Why do correct code trajectories teach models to tolerate errors?. The thing you didn't know you wanted to know: 'bounded enough for RL' is rarely a fixed fact about a task. It's a verifier you can build, a decomposition you can impose, and a reward you can shape — and where you can't build any of those, RL stalls no matter how simple the problem looks.
Sources 8 notes
Semi-formal reasoning templates enable execution-free patch equivalence verification at 93% accuracy on real agent code, crossing the reliability threshold needed for RL reward signals. This makes execution-free verification viable for certain task classes like fault localization and code reasoning.
MAKER solves million-step tasks with zero errors by decomposing into minimal subtasks, applying voting at each step, and flagging correlated errors. Surprisingly, small non-reasoning models suffice when decomposition is extreme enough, inverting the standard approach to hard problems.
The Thread Inference Model demonstrates that reasoning structured as recursive subtask trees with rule-based KV cache pruning sustains accurate reasoning beyond context limits, even when manipulating 90% of the cache. This enables single models to replace multi-agent systems by handling full recursive reasoning internally.
Omni-Thinker shows structured domains decrease output entropy while creative domains increase it. BWT-guided scheduling—training structured tasks first—yields 6.2% gains over joint training by preventing entropy collapse from damaging open-ended capabilities.
Pass@k analysis shows base models outperform RLVR models at high k, indicating RLVR doesn't expand solvable problems but rather narrows sampling toward solutions already in the base model's distribution. Distillation, by contrast, genuinely transfers new reasoning patterns.
Binary correctness rewards incentivize high-confidence guessing because they don't penalize confident wrong answers. Adding the Brier score as a second reward term mathematically guarantees joint optimization of accuracy and calibration without trade-off.
Modified DAPO training doubled SWE-bench Verified performance from 20% to 39% on Qwen2.5-72B, matching larger models. This demonstrates RL works in stateful multi-step environments with delayed rewards and complex feedback, beyond theoretical single-turn MDPs.
GRPO-RoC filters positive trajectories for quality while preserving diverse failures as negative signal, allowing a 14B model to reach frontier math performance in 510 RL steps, surpassing much larger models with cleaner reasoning.