Why do LLMs fail at iterative numerical computation in latent space?
This explores why LLMs can recognize an optimization or numerical problem but can't actually run the step-by-step calculation in their heads (their internal 'latent space') — and what that failure tells us about how these models compute at all.
This explores why LLMs can recognize a numerical optimization problem but can't actually carry out the iterative calculation internally — they look like they're computing, but they're matching templates instead. The clearest statement of the failure is direct: when asked to run an iterative numerical method, LLMs don't iterate at all. They recognize the problem as similar to ones they've seen, then emit plausible-looking but wrong values, and this persists no matter how large the model gets or how it was trained Do large language models actually perform iterative optimization?. The same ceiling shows up from a different angle in constrained optimization, where models plateau around 55–60% constraint satisfaction regardless of parameter count or whether they're billed as 'reasoning' models — a wall, not a gap you can scale past Do larger language models solve constrained optimization better?.
The deeper reason is what these models are built to do. Reframed at the computational level, an LLM is an autoregressive probability machine: it predicts likely next tokens. Tasks whose correct answers are low-probability under that distribution are systematically hard even when they're logically trivial — counting letters, reversing the alphabet — and you can predict the failures in advance from this framing Can we predict where language models will fail?. Iterative computation is exactly this kind of task: each step depends precisely on the last, and 'plausible-sounding' is not the same as 'arithmetically correct.'
What looks like reasoning turns out to be semantic association rather than symbolic execution. When researchers strip the familiar meaning out of a reasoning task and leave only the rules, LLM performance collapses — even with the correct rules sitting right there in context Do large language models reason symbolically or semantically?. A genuine iterative method is pure symbol manipulation with no semantic crutch, so the model falls back to pattern-matching the *shape* of the problem. The same surface-versus-structure gap appears in language itself: models nail statistical regularities but miss the underlying principles, and they degrade predictably as structural complexity climbs Why do language models fail at communicative optimization?, Why do large language models fail at complex linguistic tasks?.
There's a tempting counterpoint worth knowing about: under hard, out-of-distribution tasks, LLM hidden states sparsify in a systematic way that actually *stabilizes* performance rather than breaking it Do language models sparsify their activations under difficult tasks?. So the internal machinery does adapt to difficulty — it just adapts toward robust pattern selection, not toward executing an algorithm. And the models can't reason their way out of this from the inside: self-improvement is formally bounded by a generation-verification gap, meaning a reliable numerical fix needs something external to check it What stops large language models from improving themselves?, echoing the broader proof that purely internal self-correction can't escape certain mathematical limits Can any computable LLM truly avoid hallucinating?.
The thing you might not have expected to learn: the most promising fixes don't try to make the model iterate better — they route around its weakness. The MEDIC approach has the LLM solve a *simplified, deterministic* version of a hard stochastic problem, then hands the iterative grind to external machinery and validates the output with a separate critic Can LLMs design reward functions for reinforcement learning?. That's the practical shape of the answer: use the LLM for the pattern-recognition it's genuinely good at, and give the iterative computation to a tool that actually computes.
Sources 10 notes
Research shows LLMs cannot perform iterative procedures in latent space. They recognize optimization problems as template-similar and emit plausible-looking but incorrect values, a failure mode that persists across model scale and training approaches.
Across constrained-optimization tasks, LLMs converge to ~55–60% constraint satisfaction independent of architecture, parameter count, or training regime. Reasoning models do not systematically outperform standard models, suggesting a fundamental ceiling rather than a scaling gap.
By framing LLMs as autoregressive probability machines, researchers predicted tasks with low-probability target responses would be systematically harder, even when logically simple. Experiments confirmed predictions like backwards alphabet and letter counting.
When semantic content is decoupled from reasoning tasks, LLM performance collapses even with correct rules in context. Models rely on parametric commonsense and token associations rather than formal logical manipulation, constraining reasoning to training distribution semantics.
LLMs successfully replicate statistical regularities learnable from text distributions (sound symbolism, priming) but fail at principles requiring pragmatic optimization (word length economy, discourse inference). The gap reveals that communicative logic—why language has certain forms—isn't present as a trainable signal.
Top-tier LLMs like Llama3-70b consistently misidentify embedded clauses, verb phrases, and complex nominals. Performance degrades predictably as syntactic depth increases, revealing that statistical learning captures surface patterns but not deep grammatical rules.
As task difficulty increases, LLM hidden states become substantially sparser in a localized, systematic way that correlates with task unfamiliarity and reasoning load. This sparsification acts as a selective filter stabilizing performance under OOD shift rather than a failure mode.
Self-improvement in LLMs is formally bounded by the generation-verification gap, meaning every reliable fix requires something external to validate and enforce it. Models cannot escape this constraint through metacognition alone.
Three formal theorems prove that any computable LLM must hallucinate on infinitely many inputs, and internal mechanisms like self-correction cannot eliminate this mathematical constraint. External safeguards are therefore necessary, not optional.
MEDIC shows that LLMs can generate effective reward shaping functions by first solving a deterministic, simplified version of the RL problem, then converting the resulting plan into shaping rewards for the original stochastic task. A model-based critic validates LLM outputs before deployment.