INQUIRING LINE

Inquiring lines›Where does language-model reasonin…›How do modularity, routing, and se…›Do language models perform faithfu…›this inquiring line

LLMs do one thing reliably: convert vague human language into formal structure — the computation that follows is where they break.

What concrete problems do LLMs solve at the computational level?

This explores what LLMs are actually good at when you describe them at the 'computational level' — Marr's term for *what problem a system solves*, separate from how it does it — and the corpus has a surprisingly sharp answer: the thing they reliably solve is translation from messy language into formal structure, not the structured computation that follows.

This reads the question through Marr's 'computational level' — what problem is the machine actually solving — rather than asking what tasks LLMs can be prompted to attempt. The corpus converges on a striking answer: an LLM is best understood as an autoregressive probability machine Can we predict where language models will fail?, and the problem it genuinely solves is reading underspecified natural-language input and emitting a formal structure — a symbolic representation, a solver program, a logical encoding. The work it does *not* reliably solve is the deterministic computation that comes after.

The cleanest statement of this comes from optimization research: LLMs plateau around 55–60% constraint satisfaction no matter how large they get Do larger language models solve constrained optimization better?, because they don't actually run iterative numerical methods in their heads — they pattern-match a problem to a memorized template and emit plausible-looking but wrong numbers Do large language models actually perform iterative optimization?. The productive design, then, is to restrict the model to abstraction only: let it translate the word problem into solver code, and hand the numeric grinding to a deterministic solver Should LLMs handle abstraction only in optimization?. The same division of labor shows up in logic, where the LLM formalizes the problem and a symbolic engine executes the inference and feeds back machine-verifiable errors — catching translation mistakes far better than the LLM critiquing itself Can symbolic solvers fix how LLMs reason about logic?.

Why draw the line exactly there? Because the failures cluster on the execution side, not the translation side. Reasoning models wander instead of searching systematically, so success drops exponentially as problems get deeper Why do reasoning LLMs fail at deeper problem solving?. Models can state a correct principle and then fail to apply it — a 'split-brain' where explanation runs at 87% and execution at 64% Can language models understand without actually executing correctly?, a pattern sharp enough to have its own name, Potemkin understanding Can LLMs understand concepts they cannot apply?. Even RL fine-tuning doesn't install genuine procedures; it sharpens the template-matching, which collapses on out-of-distribution variants Do fine-tuned language models actually learn optimization procedures?.

The interesting twist for a curious reader: the way to *get more* reliable computation out of an LLM is to ask it to do less of it. Wrap the model inside an explicit algorithm that hands each call only the slice of context it needs, and complex reasoning becomes a set of small, debuggable sub-tasks Can algorithms control LLM reasoning better than LLMs alone?. Even a known weakness — failing to surface unstated preconditions, the old 'frame problem' — yields to this: forcing the model to enumerate background conditions explicitly jumps accuracy from 30% to 85% Do language models fail at identifying unstated preconditions?. The throughline across all of it: at the computational level, LLMs solve *abstraction and translation*. The competence you want from them isn't computing the answer — it's converting an ambiguous human description into the structured form something else can compute.

Sources 11 notes

Can we predict where language models will fail?

By framing LLMs as autoregressive probability machines, researchers predicted tasks with low-probability target responses would be systematically harder, even when logically simple. Experiments confirmed predictions like backwards alphabet and letter counting.

Do larger language models solve constrained optimization better?

Across constrained-optimization tasks, LLMs converge to ~55–60% constraint satisfaction independent of architecture, parameter count, or training regime. Reasoning models do not systematically outperform standard models, suggesting a fundamental ceiling rather than a scaling gap.

Do large language models actually perform iterative optimization?

Research shows LLMs cannot perform iterative procedures in latent space. They recognize optimization problems as template-similar and emit plausible-looking but incorrect values, a failure mode that persists across model scale and training approaches.

Should LLMs handle abstraction only in optimization?

LLMs plateau at constraint satisfaction regardless of scale, but excel at natural-language-to-formal-structure translation. The productive architecture restricts LLMs to reading input and emitting solver code, leaving numeric iteration to deterministic solvers.

Can symbolic solvers fix how LLMs reason about logic?

Logic-LM divides cognitive labor by having LLMs formulate symbolic representations while deterministic solvers execute inference and provide machine-verifiable error messages. This structured feedback loop catches translation errors better than LLM self-critique, improving faithful reasoning without requiring perfect formalization.

Show all 11 sources

Why do reasoning LLMs fail at deeper problem solving?

Current reasoning models lack the three properties of systematic exploration: validity, effectiveness, and necessity. This causes success probability to drop exponentially with problem depth, making medium problems solvable but deep problems catastrophically harder.

Can language models understand without actually executing correctly?

Large language models can articulate correct principles but systematically fail to apply them due to dissociated instruction and execution pathways. The 87% accuracy in explanations versus 64% in actions reveals this is not knowledge deficit but structural disconnect.

Can LLMs understand concepts they cannot apply?

Models can explain concepts accurately, fail to apply them, and recognize the failure—a triple pattern incompatible with human cognition. This indicates functionally disconnected explanation and execution pathways rather than simple knowledge gaps.

Do fine-tuned language models actually learn optimization procedures?

Even GRPO-trained models show sharp performance drops on out-of-distribution variants (N-1 test sets) compared to in-distribution problems, indicating RL optimizes template-matching rather than genuine problem-solving procedures.

Can algorithms control LLM reasoning better than LLMs alone?

LLM Programs embed LLMs within explicit algorithms that manage control flow and state, presenting only step-specific context to each LLM call. This information hiding addresses capability and context window limits while treating complex reasoning as modular, debuggable sub-tasks.

Do language models fail at identifying unstated preconditions?

LLMs struggle not from lacking world knowledge but from failing to bring background conditions forward as relevant constraints. Prompting that forces explicit enumeration of preconditions raises accuracy from 30% to 85%, revealing the frame problem persists in statistical systems.

Papers this line draws on 8

The research behind the notes this line reads — ranked by how closely each paper relates.

Beyond Accuracy: Evaluating the Reasoning Behavior of Large Language Models -- A Survey6.00 match · arxiv ↗
Large Language Model Reasoning Failures4.35 match · arxiv ↗
Can Large Language Models Reason and Optimize Under Constraints?3.39 match · arxiv ↗
Probing Structured Semantics Understanding and Generation of Language Models via Question Answering2.56 match · arxiv ↗
Branch-Solve-Merge Improves Large Language Model Evaluation and Generation2.51 match · arxiv ↗
Comprehension Without Competence: Architectural Limits of LLMs in Symbolic Computation and Reasoning1.79 match · arxiv ↗
Reasoning LLMs are Wandering Solution Explorers1.75 match · arxiv ↗
Is Chain-of-Thought Reasoning of LLMs a Mirage? A Data Distribution Lens1.74 match · arxiv ↗

Research prompt for your LLMexpand ↓

Copy into ChatGPT or Claude to take this line of inquiry further — it asks the model to find newer work and re-test which earlier constraints still hold.

You are a research analyst re-testing claims about LLM computational competence. The question: what concrete problems do LLMs genuinely solve at Marr's computational level—and has that answer shifted?

What a curated library found — and when (findings span 2023–2026, dated claims not current truth):
• LLMs solve abstraction & translation (formalizing problems into structured form), not execution: they plateau at 55–60% constraint satisfaction regardless of scale because they pattern-match rather than iterate (2024–2025).
• Reasoning models wander instead of systematically search; success drops exponentially with problem depth (2025).
• Split-brain failure: models explain correctly (87%) but fail execution (64%); RL fine-tuning sharpens template-matching, not genuine procedure (2024–2025).
• Wrapping LLMs in explicit algorithms—each call gets narrow context—decomposes complex tasks; forcing enumeration of unstated preconditions jumps accuracy 30%→85% (2023–2024).
• Symbolic solver integration (logic formalization + deterministic inference) catches translation errors better than self-critique (2023).

Anchor papers (verify; mind their dates):
• Logic-LM (2305.12295, May 2023): symbolic solvers offload execution from LLM.
• Reasoning LLMs are Wandering Solution Explorers (2505.20296, May 2025): wandering vs. systematic search.
• Echo Chamber (2504.07912, April 2025): RL amplifies pretraining behaviors, doesn't install procedures.
• Comprehension Without Competence (2507.10624, July 2025): split-brain as architectural limit.

Your task:
(1) RE-TEST each constraint. For the 55–60% plateau, 87%/64% split, and wandering-search claim: have newer models (o1 reasoning, updated inference oracles, retrieval-augmented solvers, or tool-integrated pipelines) relaxed these limits? Separate durable question (what IS the LLM's computational role?) from perishable limitation (what specific failure mode has been fixed?). Cite what fixed it.
(2) Surface the strongest *contradicting* or *superseding* work from the last 6 months—any paper claiming LLMs *do* execute iterative methods, or *don't* require external solvers, or *overcome* the abstraction/execution divide.
(3) Propose 2 research questions assuming the regime may have moved: e.g., "Can hybrid neuro-symbolic architectures collapse the abstraction–execution boundary?" or "Do emergent CoT scaling laws change what 'abstraction-only' means?"

Cite arXiv IDs; flag anything you cannot ground in a real paper.

LLMs do one thing reliably: convert vague human language into formal structure — the computation that follows is where they break.

Related lines of inquiry

Sources 11 notes

Papers this line draws on 8