What concrete problems do LLMs solve at the computational level?
This explores what LLMs are actually good at when you describe them at the 'computational level' — Marr's term for *what problem a system solves*, separate from how it does it — and the corpus has a surprisingly sharp answer: the thing they reliably solve is translation from messy language into formal structure, not the structured computation that follows.
This reads the question through Marr's 'computational level' — what problem is the machine actually solving — rather than asking what tasks LLMs can be prompted to attempt. The corpus converges on a striking answer: an LLM is best understood as an autoregressive probability machine Can we predict where language models will fail?, and the problem it genuinely solves is reading underspecified natural-language input and emitting a formal structure — a symbolic representation, a solver program, a logical encoding. The work it does *not* reliably solve is the deterministic computation that comes after.
The cleanest statement of this comes from optimization research: LLMs plateau around 55–60% constraint satisfaction no matter how large they get Do larger language models solve constrained optimization better?, because they don't actually run iterative numerical methods in their heads — they pattern-match a problem to a memorized template and emit plausible-looking but wrong numbers Do large language models actually perform iterative optimization?. The productive design, then, is to restrict the model to abstraction only: let it translate the word problem into solver code, and hand the numeric grinding to a deterministic solver Should LLMs handle abstraction only in optimization?. The same division of labor shows up in logic, where the LLM formalizes the problem and a symbolic engine executes the inference and feeds back machine-verifiable errors — catching translation mistakes far better than the LLM critiquing itself Can symbolic solvers fix how LLMs reason about logic?.
Why draw the line exactly there? Because the failures cluster on the execution side, not the translation side. Reasoning models wander instead of searching systematically, so success drops exponentially as problems get deeper Why do reasoning LLMs fail at deeper problem solving?. Models can state a correct principle and then fail to apply it — a 'split-brain' where explanation runs at 87% and execution at 64% Can language models understand without actually executing correctly?, a pattern sharp enough to have its own name, Potemkin understanding Can LLMs understand concepts they cannot apply?. Even RL fine-tuning doesn't install genuine procedures; it sharpens the template-matching, which collapses on out-of-distribution variants Do fine-tuned language models actually learn optimization procedures?.
The interesting twist for a curious reader: the way to *get more* reliable computation out of an LLM is to ask it to do less of it. Wrap the model inside an explicit algorithm that hands each call only the slice of context it needs, and complex reasoning becomes a set of small, debuggable sub-tasks Can algorithms control LLM reasoning better than LLMs alone?. Even a known weakness — failing to surface unstated preconditions, the old 'frame problem' — yields to this: forcing the model to enumerate background conditions explicitly jumps accuracy from 30% to 85% Do language models fail at identifying unstated preconditions?. The throughline across all of it: at the computational level, LLMs solve *abstraction and translation*. The competence you want from them isn't computing the answer — it's converting an ambiguous human description into the structured form something else can compute.
Sources 11 notes
By framing LLMs as autoregressive probability machines, researchers predicted tasks with low-probability target responses would be systematically harder, even when logically simple. Experiments confirmed predictions like backwards alphabet and letter counting.
Across constrained-optimization tasks, LLMs converge to ~55–60% constraint satisfaction independent of architecture, parameter count, or training regime. Reasoning models do not systematically outperform standard models, suggesting a fundamental ceiling rather than a scaling gap.
Research shows LLMs cannot perform iterative procedures in latent space. They recognize optimization problems as template-similar and emit plausible-looking but incorrect values, a failure mode that persists across model scale and training approaches.
LLMs plateau at constraint satisfaction regardless of scale, but excel at natural-language-to-formal-structure translation. The productive architecture restricts LLMs to reading input and emitting solver code, leaving numeric iteration to deterministic solvers.
Logic-LM divides cognitive labor by having LLMs formulate symbolic representations while deterministic solvers execute inference and provide machine-verifiable error messages. This structured feedback loop catches translation errors better than LLM self-critique, improving faithful reasoning without requiring perfect formalization.
Current reasoning models lack the three properties of systematic exploration: validity, effectiveness, and necessity. This causes success probability to drop exponentially with problem depth, making medium problems solvable but deep problems catastrophically harder.
Large language models can articulate correct principles but systematically fail to apply them due to dissociated instruction and execution pathways. The 87% accuracy in explanations versus 64% in actions reveals this is not knowledge deficit but structural disconnect.
Models can explain concepts accurately, fail to apply them, and recognize the failure—a triple pattern incompatible with human cognition. This indicates functionally disconnected explanation and execution pathways rather than simple knowledge gaps.
Even GRPO-trained models show sharp performance drops on out-of-distribution variants (N-1 test sets) compared to in-distribution problems, indicating RL optimizes template-matching rather than genuine problem-solving procedures.
LLM Programs embed LLMs within explicit algorithms that manage control flow and state, presenting only step-specific context to each LLM call. This information hiding addresses capability and context window limits while treating complex reasoning as modular, debuggable sub-tasks.
LLMs struggle not from lacking world knowledge but from failing to bring background conditions forward as relevant constraints. Prompting that forces explicit enumeration of preconditions raises accuracy from 30% to 85%, revealing the frame problem persists in statistical systems.