INQUIRING LINE

Inquiring lines›Where does language-model reasonin…›How do modularity, routing, and se…›What critical LLM failures do stan…›this inquiring line

Can an AI actually run step-by-step calculations inside its own thinking — or does it just fake the math?

Can language models execute iterative numerical methods in latent space?

This explores whether LLMs can genuinely run step-by-step numerical procedures inside their hidden 'thinking' layers — or whether they only look like they're computing while actually doing something else.

This explores whether LLMs can genuinely run step-by-step numerical procedures inside their hidden activations (latent space), rather than just produce answers that resemble the output of such procedures. The corpus has a sharp, direct answer: no — and the reason is more interesting than the verdict. Research finds that when you hand an LLM an optimization problem, it doesn't iterate toward a solution the way Newton's method or gradient descent would. Instead it recognizes the problem as template-similar to things it has seen, and emits plausible-looking but wrong values — a failure that doesn't go away as models get bigger or training improves Do large language models actually perform iterative optimization?.

That single finding sits inside a larger pattern the collection keeps surfacing: LLMs pattern-match where we expect them to compute. On genuine constrained-optimization tasks, models plateau at roughly 55–60% constraint satisfaction no matter the architecture, parameter count, or whether they're billed as 'reasoning' models — a ceiling, not a gap you can scale your way out of Do larger language models solve constrained optimization better?. And you can predict this in advance: if you treat an LLM as an autoregressive probability machine rather than a calculator, the tasks it fails are exactly the low-probability ones — counting letters, reversing the alphabet — that are logically trivial but statistically rare Can we predict where language models will fail?. Iterative numerical work is the same kind of trap: easy to state, but it requires actual procedure-following rather than recall.

What makes this more than a 'models are bad at math' story is that the same limitation shows up far from arithmetic. Models misparse deeply nested grammatical clauses, degrading predictably as structural depth grows — surface statistics, not deep rules Why do large language models fail at complex linguistic tasks?. And long-context models can match retrieval systems on semantic lookup yet collapse on relational queries that need joins across structured tables — another case where genuine multi-step manipulation, not recognition, is required Can long-context LLMs replace retrieval-augmented generation systems?. The common thread: wherever a task demands executing a procedure rather than retrieving a pattern, the latent-space machinery falls back to matching.

The corpus also gestures at what 'real latent computation' might require, which is the interesting twist for a curious reader. Latent-thought language models add a separate, slower-learning vector of 'thought' that scales independently of parameters Can latent thought vectors scale language models beyond parameters?, and neural-memory architectures like Titans carve out a distinct module for storing and updating information over time instead of folding everything into attention Can neural memory modules scale language models beyond attention limits?. These hint that iterative computation may need dedicated structure — a place to hold and revise intermediate state — rather than emerging for free from a bigger next-token predictor. There's even evidence that models do spontaneously build structured internal geometry (syntactic relations encoded in polar coordinates) How do language models encode syntactic relations geometrically?, so the latent space is not formless. It just doesn't, on its own, host the kind of loop an iterative numerical method needs.

The thing you didn't know you wanted to know: the failure isn't that LLMs can't do the arithmetic, it's that they don't realize they should be doing arithmetic at all. They see a problem that looks familiar and answer from resemblance — which is exactly why scaling, the usual fix, leaves the ceiling untouched.

Sources 8 notes

Do large language models actually perform iterative optimization?

Research shows LLMs cannot perform iterative procedures in latent space. They recognize optimization problems as template-similar and emit plausible-looking but incorrect values, a failure mode that persists across model scale and training approaches.

Do larger language models solve constrained optimization better?

Across constrained-optimization tasks, LLMs converge to ~55–60% constraint satisfaction independent of architecture, parameter count, or training regime. Reasoning models do not systematically outperform standard models, suggesting a fundamental ceiling rather than a scaling gap.

Can we predict where language models will fail?

By framing LLMs as autoregressive probability machines, researchers predicted tasks with low-probability target responses would be systematically harder, even when logically simple. Experiments confirmed predictions like backwards alphabet and letter counting.

Why do large language models fail at complex linguistic tasks?

Top-tier LLMs like Llama3-70b consistently misidentify embedded clauses, verb phrases, and complex nominals. Performance degrades predictably as syntactic depth increases, revealing that statistical learning captures surface patterns but not deep grammatical rules.

Can long-context LLMs replace retrieval-augmented generation systems?

The LOFT benchmark shows LCLMs match RAG on semantic retrieval without explicit training, but cannot execute relational queries requiring joins across structured tables. Context length alone cannot bridge this gap.

Show all 8 sources

Can latent thought vectors scale language models beyond parameters?

Latent-Thought Language Models achieve superior sample and parameter efficiency by coupling fast local variational learning with slow global decoder learning. This dual-rate scheme scales few-shot reasoning across both model and latent size, creating independent scaling dimensions beyond traditional parameter scaling.

Can neural memory modules scale language models beyond attention limits?

Titans architecture separates attention (short-term, quadratic) from neural memory (long-term, compressed), prioritizing surprising tokens for storage. The model outperforms standard Transformers and linear RNNs across tasks while scaling to 2M+ token contexts without quadratic penalties.

How do language models encode syntactic relations geometrically?

The Polar Probe shows LLMs represent syntactic type and direction through both distance and angular position between embeddings, nearly doubling accuracy over distance-only methods. This demonstrates neural networks spontaneously learn structured, symbolic-compatible geometry.

Papers this line draws on 8

The research behind the notes this line reads — ranked by how closely each paper relates.

Beyond Accuracy: Evaluating the Reasoning Behavior of Large Language Models -- A Survey3.38 match · arxiv ↗
Can Large Language Models Reason and Optimize Under Constraints?1.72 match · arxiv ↗
Long-context LLMs Struggle with Long In-context Learning1.72 match · arxiv ↗
Large Language Model Reasoning Failures1.71 match · arxiv ↗
Branch-Solve-Merge Improves Large Language Model Evaluation and Generation1.70 match · arxiv ↗
𝙻𝙼𝟸: A Simple Society of Language Models Solves Complex Reasoning1.68 match · arxiv ↗
Bigger is not always better: The importance of human-scale language modeling for psycholinguistics1.67 match · arxiv ↗
Scalable Language Models with Posterior Inference of Latent Thought Vectors0.93 match · arxiv ↗

Research prompt for your LLMexpand ↓

Copy into ChatGPT or Claude to take this line of inquiry further — it asks the model to find newer work and re-test which earlier constraints still hold.

You are a research analyst. The precise question: Can language models execute iterative numerical methods in latent space—or do they only pattern-match their outputs?

What a curated library found — and when (dated claims, not current truth):
Findings span 2023–2026; treat these as perishable, not current ground truth.

• LLMs fail to iterate toward solutions; they emit plausible-looking but incorrect values by recognizing problem templates rather than executing procedures (~2024–2026).
• Models plateau at ~55–60% constraint satisfaction on genuine optimization tasks regardless of scale or architecture; this ceiling persists across 'reasoning' models (~2024–2025).
• Failure is predictable from autoregressive statistics: tasks requiring procedure-following (counting, reversing, iterative refinement) are logically trivial but statistically rare in training data (~2024).
• Long-context LLMs collapse on relational queries requiring multi-step joins across structured tables, despite matching retrieval on semantic lookup (~2024).
• Latent-thought models (separate, slower-learning thought vectors) and neural-memory modules (dedicated update loops) scale differently and hint that real iteration may require dedicated structure, not free emergence (~2025).

Anchor papers (verify; mind their dates):
• arXiv:2402.14905 (Feb 2024) — MobileLLM: Sub-billion parameter constraints and capability ceilings.
• arXiv:2406.13121 (Jun 2024) — Long-Context LLMs Subsume Retrieval; failure on relational queries.
• arXiv:2501.00663 (Dec 2024) — Titans: Neural memory modules with adaptive update.
• arXiv:2502.01567 (Feb 2025) — Latent Thought Vectors: Separate scaling dimensions.

Your task:
(1) RE-TEST EACH CONSTRAINT. For every finding above, judge whether recent advances in chain-of-thought orchestration (multi-step prompting, memory-augmented inference, tool-use harnesses), new training methods (process supervision, synthetic iteration data), or architectural retrofits (pluggable compute modules, external solver integration) have relaxed or overturned the ~55–60% optimization ceiling. Distinguish the durable question (whether end-to-end latent iteration emerges without scaffolding) from the perishable claim (that no architecture can execute it). Cite what—if anything—has moved the needle.
(2) Surface the strongest CONTRADICTING or SUPERSEDING work from the last ~6 months. If recent papers report genuine iterative behavior in latent space, or show optimization ceilings have risen, name them and explain the mechanism.
(3) Propose 2 research questions that ASSUME the regime may have shifted: e.g., *If* external solver APIs become the standard bridge, *does* latent-space iteration become a non-question? *Or* does embedding iteration in training data (synthetic rollouts, process labels) now enable it?

Cite arXiv IDs; flag anything you cannot ground in a real paper.

Can an AI actually run step-by-step calculations inside its own thinking — or does it just fake the math?

Related lines of inquiry

Sources 8 notes

Papers this line draws on 8