INQUIRING LINE

Why do LLMs fail at directly solving stochastic control problems?

This explores why LLMs stumble when asked to *directly* compute solutions to control problems involving randomness and uncertainty — and what the corpus suggests works instead.


This question reads as: why can't an LLM just take a stochastic control problem — one where outcomes are uncertain and you have to optimize a policy over many possible futures — and produce the right answer end-to-end? The corpus points to a single underlying reason with two faces: LLMs are autoregressive pattern-matchers, not iterative solvers, and stochastic control is precisely a problem that demands iteration over uncertainty.

The sharpest evidence is that LLMs don't actually *run* numerical procedures — they recognize a problem as template-similar to ones they've seen and emit plausible-looking values without executing the underlying computation Do large language models actually perform iterative optimization?. Stochastic control is built on exactly the kind of iterative machinery (value iteration, expectation over distributions, policy refinement) that this failure mode breaks. You can see the symptom downstream: on genuine constrained-optimization tasks, models plateau around 55–60% constraint satisfaction regardless of size or whether they're 'reasoning' models — a ceiling, not a scaling gap Do larger language models solve constrained optimization better?. More scale doesn't buy you the missing capability because the capability isn't a matter of degree.

There's a deeper, almost information-theoretic version of why. If you frame an LLM as a machine that maximizes the probability of the next token, then tasks whose *correct* answer is a low-probability string become systematically hard even when they're logically trivial — and this is predictable in advance Can we predict where language models will fail?. The optimal action in a stochastic problem is often not the 'fluent' or typical-looking one; it's whatever the math says, which may sit far from the model's learned distribution over plausible continuations. The architecture's strength (fluency) is the same thing working against it.

What's interesting is what the corpus says *does* work — and it's a consistent move: don't ask the LLM to solve the stochastic problem, ask it to do the part it's good at and offload the solving. MEDIC has the LLM solve a *deterministic, simplified* version of the problem first, then converts that plan into reward-shaping signals for the real stochastic task, with a model-based critic checking the output before it's trusted Can LLMs design reward functions for reinforcement learning?. The same philosophy shows up in LLM Programs, which embed the model inside an explicit algorithm that manages state and control flow, handing the LLM only the narrow, step-specific judgment it's reliable for Can algorithms control LLM reasoning better than LLMs alone?.

The thing you might not have expected: the answer to 'LLMs can't do control' is not 'so don't use them for control.' Reinforcement learning *does* successfully scale LLMs to long-horizon, stateful, delayed-reward tasks — doubling SWE-bench performance in multi-step environments Can reinforcement learning scale beyond single-turn language tasks?. The lesson across the collection is that the LLM works as a *component* whose behavior is shaped by an external optimization loop (or a deterministic scaffold, or a critic), rather than as the solver that internalizes the stochastic dynamics itself. The failure is about asking one tool to be the whole pipeline.


Sources 6 notes

Do large language models actually perform iterative optimization?

Research shows LLMs cannot perform iterative procedures in latent space. They recognize optimization problems as template-similar and emit plausible-looking but incorrect values, a failure mode that persists across model scale and training approaches.

Do larger language models solve constrained optimization better?

Across constrained-optimization tasks, LLMs converge to ~55–60% constraint satisfaction independent of architecture, parameter count, or training regime. Reasoning models do not systematically outperform standard models, suggesting a fundamental ceiling rather than a scaling gap.

Can we predict where language models will fail?

By framing LLMs as autoregressive probability machines, researchers predicted tasks with low-probability target responses would be systematically harder, even when logically simple. Experiments confirmed predictions like backwards alphabet and letter counting.

Can LLMs design reward functions for reinforcement learning?

MEDIC shows that LLMs can generate effective reward shaping functions by first solving a deterministic, simplified version of the RL problem, then converting the resulting plan into shaping rewards for the original stochastic task. A model-based critic validates LLM outputs before deployment.

Can algorithms control LLM reasoning better than LLMs alone?

LLM Programs embed LLMs within explicit algorithms that manage control flow and state, presenting only step-specific context to each LLM call. This information hiding addresses capability and context window limits while treating complex reasoning as modular, debuggable sub-tasks.

Can reinforcement learning scale beyond single-turn language tasks?

Modified DAPO training doubled SWE-bench Verified performance from 20% to 39% on Qwen2.5-72B, matching larger models. This demonstrates RL works in stateful multi-step environments with delayed rewards and complex feedback, beyond theoretical single-turn MDPs.

Research prompt for your LLMexpand ↓

Copy into ChatGPT or Claude to take this line of inquiry further — it asks the model to find newer work and re-test which earlier constraints still hold.

You are a research analyst evaluating whether LLMs' known failures on stochastic control problems have been materially relaxed by newer models, training methods, or orchestration patterns since mid-2023.

What a curated library found — and when (dated claims, not current truth):
Findings span 2023–2026 and center on a core constraint:
• LLMs are autoregressive pattern-matchers that cannot execute iterative numerical procedures (value iteration, expectation over distributions) in latent space; they emit plausible-looking outputs instead of running the computation (~2024).
• On constrained optimization, models plateau at 55–60% constraint satisfaction regardless of scale or reasoning capability — a hard ceiling, not a scaling gap (~2026).
• Low-probability correct answers are systematically unreachable because LLM loss (next-token probability) favors fluent, typical continuations, not mathematically optimal ones that sit in the distribution's tail (~2024).
• Successful deployments offload stochastic solving: MEDIC solves a deterministic proxy first, then shapes rewards; RL loops around the LLM as a component, not the solver (~2024–2025).
• RL finetuning on long-horizon tasks (e.g., SWE-bench multi-turn) doubled performance, showing LLMs *can* internalize delayed reward when the optimization loop is external (~2025).

Anchor papers (verify; mind their dates):
• arXiv:2402.17385 (Feb 2024): Determinants of LLM-assisted Decision-Making
• arXiv:2508.03501 (Aug 2025): Training Long-Context, Multi-Turn Software Engineering Agents with RL
• arXiv:2603.23004 (Mar 2026): Can Large Language Models Reason and Optimize Under Constraints?
• arXiv:2605.12978 (May 2026): Useful Memories Become Faulty When Continuously Updated by LLMs

Your task:
(1) RE-TEST EACH CONSTRAINT. For the 55–60% plateau, the impossibility of iterative solvers, and the tail-probability problem: has post-2025 scaling (especially in reasoning models like o1 variants or process-reward finetuning) *dissolved* these, or do they still hold even at frontier scale? Separate durable from perishable. Cite what resolved it.
(2) Surface the strongest CONTRADICTING work from the last ~6 months—any paper showing an LLM *directly* solving stochastic control end-to-end, or upending the "component not solver" finding.
(3) Propose 2 research questions that assume the regime may have shifted: e.g., "Can in-context learned value functions (via few-shot MDP examples) let LLMs replace iterative solvers for toy domains?" or "Does process-reward supervision on trajectory sampling overcome tail-probability bias?"

Cite arXiv IDs; flag anything you cannot ground in a real paper.

Next inquiring lines