SYNTHESIS NOTE

Topics›Philosophy Subjectivity›this note

Can we predict where language models will fail?

Does characterizing the abstract computational problem an LLM solves—as a probability machine over sequences—let us predict which tasks it will struggle with systematically, before running experiments?

Synthesis note · 2026-05-18 · sourced from Philosophy Subjectivity

The "Levels of Analysis for LLMs" argument carries a specific empirical payoff: characterizing the abstract computational problem an LLM solves predicts where it will fail. The "embers of autoregression" line of work (McCoy et al.) is the worked example. By framing LLMs at Marr's computational level — as systems that have learned an autoregressive distribution over text — the researchers could derive in advance that tasks whose target response has low probability under the pretraining distribution would be systematically harder, even when the task itself is logically trivial.

The prediction is non-obvious. From a behavioral standpoint, you might expect difficulty to track task complexity. From the computational-level standpoint, you expect difficulty to track target probability, because the system is fundamentally a probability machine over sequences. Tasks like "write the alphabet backwards" or "count uppercase letters" can be logically simple but require generating sequences the pretraining distribution rarely supports. The framework predicted these would be hard before the experiments were run, and they were.

This is a working example of why a level-of-analysis approach is useful. Without it, the failure modes look like random capability gaps that need to be patched one by one. With it, the gaps look like predictable consequences of a particular kind of system, and they can be enumerated systematically by examining the computational characterization. The researcher who knows what problem the system is actually solving knows where to look for failure.

For interpretability research broadly, this is a template. Find the right computational-level characterization, derive its predictions about where the system should be brittle, and the brittle spots become a research program rather than an exception list.

Inquiring lines that read this note 118

This note is a source for these research framings, grouped by the broader line of inquiry each explores. Scan the bold lines of inquiry; follow any specific question forward.

Do language models learn genuine linguistic structure or just surface patterns?

Do accurate-looking LLM outputs hide structural failures in learning and reasoning?

How faithfully do LLMs reflect their actual reasoning in outputs and explanations?

Why do benchmark improvements fail to reflect actual reasoning quality?

How does example difficulty affect learning efficiency in language models?

How do evaluation biases undermine LLM quality assessment systems?

What critical LLM failures do standard benchmarks hide?

Why do language models reinforce false assumptions instead of correcting them?

How do training priors constrain what context information can override?

Can next-token prediction alone produce genuine language understanding?

What determines success in training models on multiple tasks?

How much of the combinatorial task space must training data cover?

Do language models understand semantics or rely on pattern matching?

How do language models inherit human biases from training data?

What coordination failures limit multi-agent LLM systems as they scale?

Why do LLM agents fail where game-theoretic bots succeed?

How should dialogue systems represent uncertainty from noisy speech input?

How do probabilistic dialogue systems handle ASR errors differently?

When does architectural design matter more than raw model capacity?

Why do power-law distributions make standard ML infrastructure assumptions fail?

Do language models develop causal world models or rely on statistical patterns?

Is embodied interaction necessary for language meaning and genuine agency?

Can language models acquire meaning from distributional patterns alone without joint attention?

Why do reasoning models fail at systematic problem-solving and search?

Can prompting strategies overcome LLM biases without model fine-tuning?

Which computational strategies best support reasoning in language models?

What makes LLM-guided pruning necessary for MCTS in language rather than game domains?

Can prompting inject entirely new knowledge into language models?

Can prompt position alone shift language model predictions by twenty percent?

How does memorization interact with learning and generalization?

How does rhetorical adaptation affect LLM persuasion and detectability?

Can lightweight linguistic features reliably detect LLM generated arguments?

Do language models perform faithful symbolic reasoning independent of semantic grounding?

How can LLM user simulators model realistic goal-driven conversation?

What makes natural-language APIs particularly suited to LLM-based simulation?

What role does compression play in language model capability and generalization?

How does modeling capability relate to lossless compression in language models?

Does fine-tuning modify underlying model capabilities or only behavioral outputs?

Can we predict out-of-distribution generalization without access to downstream tasks?

Why do semantic similarity and task relevance diverge in vector embeddings?

Why do unit-sphere spaces fail at distinguishing word order and negation?

How can identical external performance mask different internal representations?

Why should deep learning theory prioritize average-case over worst-case analysis?

How should models express uncertainty rather than forced confident answers?

Can models distinguish between logical impossibility and their own execution limits?

What are the consequences of models training on synthetic data?

Can trained models encode programs more complex than their data-generating process?

What limits mechanistic interpretability's ability to characterize models?

How do mechanistic features compare to natural language for interpretability?

When does optimizing for quality undermine the value of diversity?

Why do more capable language models benefit more from diversity elicitation?

Does model scaling alone produce compositional generalization without symbolic mechanisms?

Where do neural networks still fail at compositional generalization despite scaling?

What articulatory information do speech signals carry that text cannot?

Why do multimodal models fail on rare and underrepresented concepts?

What structural advantages do diffusion language models offer over autoregressive methods?

Why do different LLMs converge on similar outputs in open-ended tasks?

Related concepts in this collection 3

This note in its neighbourhood — explore the map, then jump to a related concept in the list below.

Concept map

14 direct connections · 127 in 2-hop network ·dense cluster Open in graph ↗

Can we predict where language models will fail? Can cognitive science methods unlock how LLMs actu… Can indirect psychology tests reveal what LLMs con… Does chain-of-thought reasoning actually generaliz…

Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph

your link semantically near linked from elsewhere

Can we predict where language models will fail?

Inquiring lines that read this note 118

Related concepts in this collection 3

Related papers in this collection 8

Search by related questions 4