Can we predict where language models will fail?
Does characterizing the abstract computational problem an LLM solves—as a probability machine over sequences—let us predict which tasks it will struggle with systematically, before running experiments?
The "Levels of Analysis for LLMs" argument carries a specific empirical payoff: characterizing the abstract computational problem an LLM solves predicts where it will fail. The "embers of autoregression" line of work (McCoy et al.) is the worked example. By framing LLMs at Marr's computational level — as systems that have learned an autoregressive distribution over text — the researchers could derive in advance that tasks whose target response has low probability under the pretraining distribution would be systematically harder, even when the task itself is logically trivial.
The prediction is non-obvious. From a behavioral standpoint, you might expect difficulty to track task complexity. From the computational-level standpoint, you expect difficulty to track target probability, because the system is fundamentally a probability machine over sequences. Tasks like "write the alphabet backwards" or "count uppercase letters" can be logically simple but require generating sequences the pretraining distribution rarely supports. The framework predicted these would be hard before the experiments were run, and they were.
This is a working example of why a level-of-analysis approach is useful. Without it, the failure modes look like random capability gaps that need to be patched one by one. With it, the gaps look like predictable consequences of a particular kind of system, and they can be enumerated systematically by examining the computational characterization. The researcher who knows what problem the system is actually solving knows where to look for failure.
For interpretability research broadly, this is a template. Find the right computational-level characterization, derive its predictions about where the system should be brittle, and the brittle spots become a research program rather than an exception list.
Inquiring lines that use this note as a source 115
This note is a source for these synthesized inquiries. Follow a line forward into its question, or open it to trace back to all of its sources.
- Why do different language models independently produce similar outputs?
- What should we call errors in LLM outputs when hallucination does not apply?
- Why do sigmoid conflict curves look the same across different language models?
- How much of LLM reasoning failure stems from missing knowledge versus signal weighting?
- How should benchmarks test whether models fit algorithms or patterns?
- What makes a problem instance unfamiliar to a language model?
- Can universal function approximators be expensive to learn in practice?
- Why do LLM outputs match researcher priors without solving tasks correctly?
- Why do NLP benchmarks systematically exclude ambiguous test cases from evaluation?
- Do language models learn surface patterns instead of underlying linguistic principles?
- Why do NLP benchmarks exclude ambiguous instances from evaluation?
- Why do language models produce plausible outputs over accurate failure reports?
- Can implicit linguistic information ever be reliably learned from training data?
- Why do token-level language models fail at utterance-level pragmatic optimization?
- How much of the combinatorial task space must training data cover?
- Why do language models fail at planning despite understanding strategies?
- Why do language models fail when semantic content is stripped away?
- Do token probability distributions in LLMs track human reaction time patterns?
- Why do language models fall back on frequency heuristics under structural complexity?
- Can simple diagnostic tests predict language model performance in production complexity?
- Why do LLM agents fail where game-theoretic bots succeed?
- Do language models learn surface patterns that appear generalizable but actually fail under shift?
- How do rare linguistic registers differ from conceptually complex examples?
- Does next-token prediction alone produce genuine functional language competence?
- Why do large language models fail at temporal reasoning in complex legal cases?
- How do probabilistic dialogue systems handle ASR errors differently?
- Why do power-law distributions make standard ML infrastructure assumptions fail?
- Why does homework adherence remain low despite advances in language model capability?
- Do language models build world models or just task-specific heuristics?
- Can language models acquire meaning from distributional patterns alone without joint attention?
- Why do models fail on logically equivalent tasks with different data distributions?
- Why do task-specific heuristics fail at generalizing to sparse data regions?
- Why do generative and discriminative language model procedures disagree?
- Why do language models naturally under-abstain instead of over-abstain?
- Why do different language models independently converge toward similar outputs in open-ended generation?
- Is paraphrase invariance a reliable assumption when deploying language models in production?
- Why do large language models still have systematic blind spots with complex structures?
- Why do language models fail at grounding and inference?
- What reveals the epistemic limits of language models?
- Can we predict when a specific prompt will fail on a given question?
- Can instance seeds work for tasks beyond language understanding benchmarks?
- Do LLMs rely on surface heuristics instead of learning recursive grammar rules?
- Can complexity-stratified testing reveal whether LLMs understand grammatical structure?
- Why do rare complex structures in training data harm LLM generalization?
- Why do LLMs fail at semantic generalization despite grammatical accuracy?
- How do general language model benchmarks predict specialized domain performance?
- Do language models actually learn linguistic structure or just surface statistics?
- What internal mechanisms explain LLM reasoning and representation limits?
- What structural properties of language models make fabrication inevitable?
- Why do LLMs inherit causal biases from their training data?
- Why do standard NLP benchmarks hide the most critical language limitations?
- Do LLMs fail exploration because of context integration or computational limitations?
- Do language models encode deep syntactic structure or only surface-level patterns?
- How does structural depth in sentences predict LLM annotation accuracy?
- Can auditing LLM performance on complex inputs improve NLP pipeline reliability?
- Why do surface generalizations fail on unusual syntactic structures?
- What makes LLM-guided pruning necessary for MCTS in language rather than game domains?
- How do description-based identifiers bias language model output distribution?
- Does the prediction unit shape what language models actually learn?
- Why do NLP models fail at recognizing multiple valid interpretations?
- Why do different LLMs converge on nearly identical outputs?
- Can prompt position alone shift language model predictions by twenty percent?
- Why does training data not function as a searchable corpus?
- Why do older datasets show higher LLM performance than newer ones?
- What happens when we treat LLM outputs as sampled rather than stored?
- Does directional knowledge failure indicate shallow pattern matching over deep representation?
- Does sequence prediction accuracy prove an underlying world model exists?
- Can a model be strong at MMLU but weak at long-horizon tasks?
- Can lightweight linguistic features reliably detect LLM generated arguments?
- How does training distribution shape what language models understand best?
- Why do smaller LLMs fail at zero-shot argument scheme classification?
- What concrete problems do LLMs solve at the computational level?
- Why do language models plateau at 55 to 60 percent constraint satisfaction?
- Why do LLMs fail at directly solving stochastic control problems?
- What latent mechanisms do LLMs use when they cannot execute iterative methods?
- Why do language models fail at iterative numerical optimization despite scale?
- What makes natural-language APIs particularly suited to LLM-based simulation?
- Why do LLM descriptions of argument schemes work better than formal definitions for classification?
- How does modeling capability relate to lossless compression in language models?
- How does the generation-verification gap prevent language models from improving themselves?
- Can surface-level correctness hide failures in structural learning by LLMs?
- How do pretrained language models represent inferential patterns versus lexical and positional cues?
- Why do language models fail at understanding ambiguous or complex requirements?
- Can we predict out-of-distribution generalization without access to downstream tasks?
- Do newer LLM generations create worse detector bias through increased linguistic divergence?
- How do corpus statistics shape the abstraction hierarchy in language model representations?
- Why do unit-sphere spaces fail at distinguishing word order and negation?
- Why should deep learning theory prioritize average-case over worst-case analysis?
- Do pretrained language models carry reusable computational scaffolding for length handling?
- Do newer language models diverge further from human lexical patterns?
- Do larger language models overcome greediness in sequential decision-making?
- Do independent LLM outputs converge enough to create artificial hiveminds?
- How do training data distributions constrain what language models can accurately know?
- Why do long-context language models struggle with compositional reasoning tasks?
- Why do language models plateau at constraint satisfaction regardless of scale?
- Can language models execute iterative numerical methods in latent space?
- Can models distinguish between logical impossibility and their own execution limits?
- Can trained models encode programs more complex than their data-generating process?
- Why does representation sparsity reliably indicate task difficulty for language models?
- How does the pretraining distribution shape what LLMs find hard?
- Can we systematically enumerate LLM failure modes from first principles?
- Why do LLMs fail at iterative numerical computation in latent space?
- Can irrelevant information reliably expose the limits of LLM reasoning?
- Does pseudo-labeling from LLMs degrade classifier performance?
- Can LLMs reliably audit other language models for errors?
- How do mechanistic features compare to natural language for interpretability?
- Why do naive pruning and quantization destroy LLM performance so easily?
- Can language models beat human experts in domains with sparse historical signals?
- Why do more capable language models benefit more from diversity elicitation?
- Where do neural networks still fail at compositional generalization despite scaling?
- What does next-token prediction tell us about compositional linguistic competence?
- Why do multimodal models fail on rare and underrepresented concepts?
- Can instruction prompts reliably steer an LLM judge toward specific alignment targets?
- What empirical evidence supports the Learning Law on real language models?
- What capability boundary exists in LLM prediction of effect sizes?
Related concepts in this collection 3
This note in its neighbourhood — explore the map, then jump to a related concept in the list below.
Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph
-
Can cognitive science methods unlock how LLMs actually work?
Does Marr's three-level framework—developed to understand biological minds—offer interpretability researchers the structured methodology they need to decode opaque language models?
same paper, the framework this instantiates
-
Can indirect psychology tests reveal what LLMs conceal about bias?
Alignment training teaches LLMs to refuse direct questions about bias, but do implicit psychological methods like the IAT expose the underlying associations that remain encoded in their representations?
same paper, the algorithmic-level companion
-
Does chain-of-thought reasoning actually generalize beyond training data?
Explores whether CoT's strong performance on benchmarks reflects genuine reasoning ability or merely reflects learned patterns tied to specific distributions. Tests how CoT behaves when tasks, formats, or reasoning length shift away from training data.
adjacent: another distribution-bounded failure mode
Related papers in this collection 8
Papers most semantically related to this note, ranked by cosine similarity in the embedding space.
- Tree of Thoughts: Deliberate Problem Solving with Large Language Models
- Large Language Diffusion Models
- Large Language Model Reasoning Failures
- Beyond Accuracy: Evaluating the Reasoning Behavior of Large Language Models -- A Survey
- Reasoning or Reciting? Exploring the Capabilities and Limitations of Language Models Through Counterfactual Tasks
- Long-context LLMs Struggle with Long In-context Learning
- Arithmetic Without Algorithms: Language Models Solve Math With a Bag of Heuristics
- Large Language Models Do Not Simulate Human Psychology
Original note title
the computational level predicts where LLMs fail — embers of autoregression anticipated low-probability target failures