INQUIRING LINE

Why do LLMs fail at iterative numerical computation in latent space?

This explores why LLMs can recognize an optimization or numerical problem but can't actually run the step-by-step calculation in their heads (their internal 'latent space') — and what that failure tells us about how these models compute at all.


This explores why LLMs can recognize a numerical optimization problem but can't actually carry out the iterative calculation internally — they look like they're computing, but they're matching templates instead. The clearest statement of the failure is direct: when asked to run an iterative numerical method, LLMs don't iterate at all. They recognize the problem as similar to ones they've seen, then emit plausible-looking but wrong values, and this persists no matter how large the model gets or how it was trained Do large language models actually perform iterative optimization?. The same ceiling shows up from a different angle in constrained optimization, where models plateau around 55–60% constraint satisfaction regardless of parameter count or whether they're billed as 'reasoning' models — a wall, not a gap you can scale past Do larger language models solve constrained optimization better?.

The deeper reason is what these models are built to do. Reframed at the computational level, an LLM is an autoregressive probability machine: it predicts likely next tokens. Tasks whose correct answers are low-probability under that distribution are systematically hard even when they're logically trivial — counting letters, reversing the alphabet — and you can predict the failures in advance from this framing Can we predict where language models will fail?. Iterative computation is exactly this kind of task: each step depends precisely on the last, and 'plausible-sounding' is not the same as 'arithmetically correct.'

What looks like reasoning turns out to be semantic association rather than symbolic execution. When researchers strip the familiar meaning out of a reasoning task and leave only the rules, LLM performance collapses — even with the correct rules sitting right there in context Do large language models reason symbolically or semantically?. A genuine iterative method is pure symbol manipulation with no semantic crutch, so the model falls back to pattern-matching the *shape* of the problem. The same surface-versus-structure gap appears in language itself: models nail statistical regularities but miss the underlying principles, and they degrade predictably as structural complexity climbs Why do language models fail at communicative optimization?, Why do large language models fail at complex linguistic tasks?.

There's a tempting counterpoint worth knowing about: under hard, out-of-distribution tasks, LLM hidden states sparsify in a systematic way that actually *stabilizes* performance rather than breaking it Do language models sparsify their activations under difficult tasks?. So the internal machinery does adapt to difficulty — it just adapts toward robust pattern selection, not toward executing an algorithm. And the models can't reason their way out of this from the inside: self-improvement is formally bounded by a generation-verification gap, meaning a reliable numerical fix needs something external to check it What stops large language models from improving themselves?, echoing the broader proof that purely internal self-correction can't escape certain mathematical limits Can any computable LLM truly avoid hallucinating?.

The thing you might not have expected to learn: the most promising fixes don't try to make the model iterate better — they route around its weakness. The MEDIC approach has the LLM solve a *simplified, deterministic* version of a hard stochastic problem, then hands the iterative grind to external machinery and validates the output with a separate critic Can LLMs design reward functions for reinforcement learning?. That's the practical shape of the answer: use the LLM for the pattern-recognition it's genuinely good at, and give the iterative computation to a tool that actually computes.


Sources 10 notes

Do large language models actually perform iterative optimization?

Research shows LLMs cannot perform iterative procedures in latent space. They recognize optimization problems as template-similar and emit plausible-looking but incorrect values, a failure mode that persists across model scale and training approaches.

Do larger language models solve constrained optimization better?

Across constrained-optimization tasks, LLMs converge to ~55–60% constraint satisfaction independent of architecture, parameter count, or training regime. Reasoning models do not systematically outperform standard models, suggesting a fundamental ceiling rather than a scaling gap.

Can we predict where language models will fail?

By framing LLMs as autoregressive probability machines, researchers predicted tasks with low-probability target responses would be systematically harder, even when logically simple. Experiments confirmed predictions like backwards alphabet and letter counting.

Do large language models reason symbolically or semantically?

When semantic content is decoupled from reasoning tasks, LLM performance collapses even with correct rules in context. Models rely on parametric commonsense and token associations rather than formal logical manipulation, constraining reasoning to training distribution semantics.

Why do language models fail at communicative optimization?

LLMs successfully replicate statistical regularities learnable from text distributions (sound symbolism, priming) but fail at principles requiring pragmatic optimization (word length economy, discourse inference). The gap reveals that communicative logic—why language has certain forms—isn't present as a trainable signal.

Why do large language models fail at complex linguistic tasks?

Top-tier LLMs like Llama3-70b consistently misidentify embedded clauses, verb phrases, and complex nominals. Performance degrades predictably as syntactic depth increases, revealing that statistical learning captures surface patterns but not deep grammatical rules.

Do language models sparsify their activations under difficult tasks?

As task difficulty increases, LLM hidden states become substantially sparser in a localized, systematic way that correlates with task unfamiliarity and reasoning load. This sparsification acts as a selective filter stabilizing performance under OOD shift rather than a failure mode.

What stops large language models from improving themselves?

Self-improvement in LLMs is formally bounded by the generation-verification gap, meaning every reliable fix requires something external to validate and enforce it. Models cannot escape this constraint through metacognition alone.

Can any computable LLM truly avoid hallucinating?

Three formal theorems prove that any computable LLM must hallucinate on infinitely many inputs, and internal mechanisms like self-correction cannot eliminate this mathematical constraint. External safeguards are therefore necessary, not optional.

Can LLMs design reward functions for reinforcement learning?

MEDIC shows that LLMs can generate effective reward shaping functions by first solving a deterministic, simplified version of the RL problem, then converting the resulting plan into shaping rewards for the original stochastic task. A model-based critic validates LLM outputs before deployment.

Research prompt for your LLMexpand ↓

Copy into ChatGPT or Claude to take this line of inquiry further — it asks the model to find newer work and re-test which earlier constraints still hold.

You are a research analyst re-testing constraints on LLM numerical computation. The question: *Can LLMs execute iterative algorithms in latent space, or are they fundamentally confined to pattern matching?* Treat this as still-open.

What a curated library found — and when (dated claims, not current truth):
Findings span 2023–2026; treat all as perishable.
- LLMs plateau at ~55–60% constraint satisfaction regardless of scale or 'reasoning' training (2026-03).
- Autoregressive probability predicts failures: tasks with low-probability correct answers fail systematically, including iteration (2023–2024).
- Models are in-context semantic reasoners, not symbolic; strip semantics and performance collapses even with rules in context (2023-05).
- Self-improvement is formally bounded by a generation-verification gap; purely internal correction cannot escape mathematical limits (2024-01, 2024-12).
- Under OOD shift, LLM hidden states sparsify and *stabilize* performance, but toward robust pattern selection, not algorithm execution (2026-03).
- MEDIC approach: delegate iteration to external tools; LLM solves simplified deterministic proxy, critic validates (implicit in synthesis).

Anchor papers (verify; mind their dates):
- arXiv:2305.14825 (2023-05): In-Context Semantic Reasoners
- arXiv:2401.11817 (2024-01): Hallucination is Inevitable
- arXiv:2412.02674 (2024-12): Self-Improvement Gap
- arXiv:2603.23004 (2026-03): Reasoning Under Constraints

Your task:
(1) **RE-TEST EACH CONSTRAINT.** For each finding above—especially the 55–60% ceiling and the semantic/symbolic divide—judge whether newer models, in-context learning techniques (chain-of-thought variants, scratchpad), tool use (calculators, symbolic engines), or multi-step orchestration (critique loops, verifier ensembles) have since relaxed or overturned it. Separate the durable question (can autoregressive tokens ever *execute* iteration?) from the perishable limitation (can routing + external compute solve it?). Cite what resolved it.
(2) **Surface the strongest CONTRADICTING work.** Find papers from the last ~6 months arguing LLMs *can* reliably perform iterative computation under some regime (e.g., process reward models, tree search with guaranteed termination, or learned iterative solvers).
(3) **Propose 2 research questions assuming the regime has shifted:** e.g., *If LLMs can't iterate internally, can they learn to construct *verifiable* invariants that external solvers check?* Or *Under what task structure does delegation-to-tool become empirically lossless?*

Cite arXiv IDs; flag anything you cannot ground in a real paper.

Next inquiring lines