Do large language models actually perform iterative optimization?
Explores whether LLMs execute genuine numerical procedures like Newton-Raphson or instead pattern-match to memorized solution templates when solving constrained optimization problems.
The constraint-optimization study identifies the mechanism behind the 55-60% plateau directly. LLMs cannot actually perform Newton-Raphson iterations in their latent space. They cannot execute primal-dual updates, nor any other iterative numerical procedure that genuine optimization requires. When asked to do so, they fall back to what the paper calls "result guessing" — recognizing the problem as similar to a standard power grid (or financial dataset, or security scenario) and emitting values that pattern-match what a valid solution should look like.
The fallback is silent. The output is fluent, well-formatted, often plausible. It can pass surface-level inspection because the model has seen many examples of what answers in this domain look like. What it has not done is solve the problem. The constraint values are wrong in ways that physical or financial systems would actually reject.
This explains why scale, architecture, and training regime do not move the plateau. They improve the template but not the procedure. A larger model has seen more example solutions and can produce more convincing guesses. Reinforcement learning on outcome rewards reinforces the template-matching pattern. None of this installs the iterative-computation capability the problem requires.
The mechanism — pattern-match against memorized solution-shapes when genuine computation is required — generalizes beyond optimization. It is plausibly the same mechanism behind a class of mathematical-reasoning failures where models produce confidently wrong numerical answers that resemble the right shape. The category is "looks like a solution; is not derived from one."
Inquiring lines that use this note as a source 113
This note is a source for these synthesized inquiries. Follow a line forward into its question, or open it to trace back to all of its sources.
- Can communication problems and optimization problems be addressed with the same alignment approaches?
- Can closed-form solutions compete with gradient descent optimization?
- What distinguishes minimal-pair asymmetry from standard accuracy evaluation?
- How do unstated constraints become invisible to training data distributions?
- Can benchmarks designed for shortcut learning detect heuristic override failures?
- How do cost-efficient LLM models compare to high-performance ones in recommendation?
- Why do sigmoid conflict curves look the same across different language models?
- Can explicit constraint statements override the dominance of surface heuristics?
- How should benchmarks test whether models fit algorithms or patterns?
- Can universal function approximators be expensive to learn in practice?
- Why do simple length heuristics outperform sophisticated semantic methods?
- Does the Heuristic Override Benchmark measure enumeration or world knowledge?
- Why do LLM outputs match researcher priors without solving tasks correctly?
- How does intersubjective validation differ from pattern recognition in training data?
- Why do token-level language models fail at utterance-level pragmatic optimization?
- Does scaling model size solve compositional generalization problems?
- Would hybrid systems combining LLMs with symbolic solvers overcome the retraction limitation?
- Why do language models fall back on frequency heuristics under structural complexity?
- How does the discrete token bottleneck prevent gradient flow in language model control?
- Why does most refinement in iterative models maintain answers rather than improve them?
- How does latent space diffusion enable evolutionary search in high dimensions?
- Can prompt optimization alone inject knowledge models don't already have?
- Can latent recurrence and energy minimization both escape the same computational depth constraints?
- Do language models build world models or just task-specific heuristics?
- How does fitness-proportional selection guide LLM recombination in unstructured solution spaces?
- What decomposition level minimizes both error rate and computational cost in practice?
- Can prompt optimization inject new knowledge into language models?
- Can neural networks implement genuine algorithms or only statistical pattern matching?
- Why do task-specific heuristics fail at generalizing to sparse data regions?
- How do LLMs compress specific expert knowledge into median abstraction?
- Why do large language models still have systematic blind spots with complex structures?
- Can instance seeds work for tasks beyond language understanding benchmarks?
- How can gradients flow through discrete document selection?
- Can prompt optimization inject genuinely new knowledge into a model?
- How does algorithmic control flow define computational graph structure in LLM programs?
- Do task-specific heuristics emerge because they compress well enough?
- Do latent sequence vectors outperform per-token latent iterative computation for reasoning?
- Do LLMs rely on surface heuristics instead of learning recursive grammar rules?
- Why do rare complex structures in training data harm LLM generalization?
- How do general language model benchmarks predict specialized domain performance?
- Do standard language benchmarks underestimate what LLMs can actually do?
- Why do standard NLP benchmarks hide the most critical language limitations?
- Do instruction-tuned models learn tasks or just output format distributions?
- Why does iterative refinement amplify rather than correct reasoning errors?
- Why does genetic programming outperform direct LLM generation by 86 percent?
- Can LLMs reliably generate novel working architectures without structured representations?
- How do description-based identifiers bias language model output distribution?
- Can textual gradients generalize natural language feedback across computation graphs?
- Can LLMs recover true joint distributions from marginal census data?
- What are the computational trade-offs between training-time vs inference-time consistency correction?
- What formal language complexity level matches transformer computational limits best?
- Is gradient behavior in language functional or a sign of ambiguity?
- What knowledge can prompt optimization actually activate in trained models?
- How do language agents become optimizable computational graphs automatically?
- What distinguishes hierarchical dual-recurrence from flat parameter-sharing recurrence?
- Why do large language models outperform fine-tuned models once repeated items are removed?
- How does trajectory filtering handle noise when language models use code execution tools?
- Can optimization algorithms exploit the shift between procedural and planning bottlenecks?
- Why do smaller models favor code formats while larger models prefer natural language?
- Can critique-only calls in LLMs exploit a measurable gap between generation and evaluation?
- Do small models show different parameter efficiency patterns than large models?
- How should tiny language models be architected differently than large ones?
- How many particles and iterations does optimal expert discovery require?
- How does symbolic solver feedback differ from language-based self-critique?
- What filtering criteria best identify student-compatible refinements from teacher models?
- What happens when we treat LLM outputs as sampled rather than stored?
- What distinguishes intrinsic search from extrinsic search method approaches?
- What non-parametric methods could replace latent factors for inductive learning?
- Why do backward-looking benchmarks underestimate LLM scientific value?
- Does sequence prediction accuracy prove an underlying world model exists?
- Why do smaller LLMs fail at zero-shot argument scheme classification?
- Why does extended chain-of-thought reasoning fail to improve numerical optimization performance?
- What mechanism causes LLMs to plateau on numerical optimization tasks?
- Why do reasoning models fail to improve constrained optimization performance?
- Can LLMs successfully translate natural language into formal solver specifications?
- How should organizations redesign workflows if LLMs cannot solve optimization directly?
- What concrete problems do LLMs solve at the computational level?
- Why do language models plateau at 55 to 60 percent constraint satisfaction?
- Why do LLMs fail at directly solving stochastic control problems?
- What latent mechanisms do LLMs use when they cannot execute iterative methods?
- Why do language models fail at iterative numerical optimization despite scale?
- How do out-of-distribution tests reveal that optimization learning is memorization?
- What makes natural-language APIs particularly suited to LLM-based simulation?
- How do deterministic symbolic solvers improve the reliability of language model reasoning?
- Why does AI code generation lag behind pattern-matching benchmarks?
- Can surface-level correctness hide failures in structural learning by LLMs?
- How should skill libraries coordinate with gradient-based weight optimization?
- Why does teacher forcing fail to capture long-range dependencies?
- Can tool use or self-conditioning fix degradation in extended LLM workflows?
- Do pretrained language models carry reusable computational scaffolding for length handling?
- Why do hybrid memory and compute sparsity outperform pure parameter scaling?
- Why does iterative refinement fail when information stays constant?
- Do independent LLM outputs converge enough to create artificial hiveminds?
- How can we probe LLM representations in channels that training did not target?
- Why do language models plateau at constraint satisfaction regardless of scale?
- Can language models execute iterative numerical methods in latent space?
- Can width-scaling replace depth-scaling on inherently sequential problems?
- Can smaller LLMs perform tool use tasks through modular decomposition?
- What prevents monolithic LLMs from coordinating decomposition with execution?
- Can categorical correctness signals stop dense optimizers from finding loopholes?
- Can trained models encode programs more complex than their data-generating process?
- Why does prompt optimization alone fail to inject genuinely new knowledge?
- Why do LLMs fail at iterative numerical computation in latent space?
- What constraint satisfaction rate do LLMs achieve at scale?
- Why do LLMs struggle more when only numerical values change?
- Are newer larger language models actually worse at faithful summarization?
- Can LLMs simultaneously reason and optimize their own modules?
- How do LLM activations sparsify differently under out-of-distribution inputs?
- Why is latent-level prediction more sample-efficient than token-level prediction?
- Can a two-layer network outgeneralize billion-parameter models through recursion alone?
- How do normalization and input injection control emergence of fixed points?
- What power-law scaling patterns emerge when consistency models are trained at scale?
- Can instruction prompts reliably steer an LLM judge toward specific alignment targets?
Related concepts in this collection 5
This note in its neighbourhood — explore the map, then jump to a related concept in the list below.
Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph
-
Do larger language models solve constrained optimization better?
Explores whether scaling LLMs—through more parameters, better training, or reasoning extensions—improves their ability to satisfy constraints in real optimization problems like power grids and portfolios.
same paper, the plateau this mechanism explains
-
Do reasoning models actually beat standard models on optimization?
Explores whether extended chain-of-thought in reasoning models delivers performance gains on constraint-satisfaction problems like power-grid optimization. Matters because reasoning models are treated as automatic upgrades, but the evidence may not support that claim.
same paper, why extended CoT does not fix it
-
Do fine-tuned language models actually learn optimization procedures?
Can RL fine-tuning teach LLMs to solve constraint-optimization problems through genuine reasoning, or does it merely sharpen pattern-matching? Testing on out-of-distribution variants reveals the mechanism.
same paper, the diagnostic that exposes the memorization
-
Does chain-of-thought reasoning reveal genuine inference or pattern matching?
Explores whether CoT instructions unlock real reasoning capabilities or simply constrain models to mimic familiar reasoning patterns from training data. This matters for understanding whether language models can actually reason abstractly.
adjacent: imitation vs computation
-
What do models actually learn from chain-of-thought training?
When models train on reasoning demonstrations, do they memorize content details or absorb reasoning structure? Testing with corrupted data reveals which aspects of CoT samples actually drive learning.
adjacent: form-over-content failure mode
Related papers in this collection 8
Papers most semantically related to this note, ranked by cosine similarity in the embedding space.
- Branch-Solve-Merge Improves Large Language Model Evaluation and Generation
- Can Large Language Models Reason and Optimize Under Constraints?
- Algorithm of Thoughts: Enhancing Exploration of Ideas in Large Language Models
- Metacognitive Reuse: Turning Recurring LLM Reasoning Into Concise Behaviors
- Chain of Thoughtlessness? An Analysis of CoT in Planning
- Beyond Accuracy: Evaluating the Reasoning Behavior of Large Language Models -- A Survey
- A Mechanistic Analysis of Looped Reasoning Language Models
- Stop Overthinking: A Survey on Efficient Reasoning for Large Language Models
Original note title
LLMs cannot execute iterative numerical methods in latent space and fall back to result guessing against memorized templates