Why do LLMs struggle with exploration in simple decision tasks?

This explores why large language models fail at exploration—a core decision-making capability—even when they excel at other tasks, and what specific conditions might help them succeed.

Synthesis note · 2026-02-22 · sourced from Reasoning Architectures

Decision-making agents require three core capabilities: generalization (supervised learning), exploration (making suboptimal short-term decisions to gather information), and planning (accounting for long-term consequences). LLMs have been shown to possess generalization and limited planning. Exploration turns out to be the hardest.

In systematic evaluation across multi-armed bandit environments — one of the simplest exploration problems — only a single LLM/prompt configuration achieves satisfactory exploratory behavior: GPT-4 + explicit exploratory hints + external per-arm history summarization + zero-shot chain-of-thought. All other configurations fail, including GPT-4 with just the explicit hints, or with CoT but without external summarization.

The critical factor is external history summarization. Without it, the model must track which arms have been tried and what returns were obtained purely from the raw interaction history in context. When the history grows long, this becomes an effective in-context computation problem — the model must maintain and update a running average per arm from unstructured context. LLMs appear to fail this computation reliably.

External summarization converts unstructured history (list of (arm, reward) tuples) into structured per-arm aggregates that are trivially readable. With this pre-processing, GPT-4 can then apply exploratory reasoning correctly.

The negative interpretation matters: External summarization is a non-trivial algorithm design problem in complex environments. If the history has thousands of entries with complex structure (state, action, observation sequences), pre-processing to the right summary form is itself a hard problem. LLM exploration capability in truly complex environments is likely to remain unreliable.

This connects to Why do trajectories matter more than individual examples for in-context learning?: both findings reveal that LLMs' ICL capabilities in sequential decision-making contexts are fragile and depend on specific data presentation choices that are non-trivial to implement.

Inquiring lines that read this note 16

This note is a source for these research framings, grouped by the broader line of inquiry each explores. Scan the bold lines of inquiry; follow any specific question forward.

Why can LLMs generate ideas better than they evaluate them?

Does decoupling planning from execution improve multi-step reasoning accuracy?

What role does exploration-exploitation balance play in abstraction formation?

What critical LLM failures do standard benchmarks hide?

Why does reinforcement learning suppress output diversity compared to supervised fine-tuning?

Why do language models reinforce false assumptions instead of correcting them?

Why do large language models fail at taking conversational initiative?

Do language models develop causal world models or rely on statistical patterns?

How does example difficulty affect learning efficiency in language models?

Why does exploration quality matter more than learner network depth?

How does reasoning graph topology affect breakthrough insights and generalization?

What distinguishes systematic search from wandering exploration in reasoning?

How faithfully do LLMs reflect their actual reasoning in outputs and explanations?

Why do LLMs explain correct reasoning but then choose greedy actions?

Do language models learn genuine linguistic structure or just surface patterns?

What causes language models' strategic rationality to decline with increased game complexity?

Related concepts in this collection 6

This note in its neighbourhood — explore the map, then jump to a related concept in the list below.

Concept map

16 direct connections · 162 in 2-hop network ·dense cluster Open in graph ↗

Why do LLMs struggle with exploration in simple … Why do trajectories matter more than individual ex… Why do language models ignore information in their… Can we allocate inference compute based on prompt … Can transformers learn to solve new problems withi… Do language models learn differently from good ver… Why do large language models explore less effectiv…

Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph

your link semantically near linked from elsewhere

Why do trajectories matter more than individual examples for in-context learning? Can language models learn new sequential decision-making tasks from context alone, and if so, what data properties make this possible? This explores why isolated state-action pairs fail where full trajectories succeed.
both reveal specific data structure requirements for LLM sequential decision making ICL
Why do language models ignore information in their context? Explores why language models sometimes override contextual information with prior training associations, and whether providing more context can solve this problem.
exploration failure may involve context integration failure when unstructured history competes with parametric patterns
Can we allocate inference compute based on prompt difficulty? Does adjusting how much compute each prompt receives—rather than using a fixed budget—improve model performance? Could smarter allocation let smaller models compete with larger ones?
exploration tasks have unbounded difficulty without external summarization; compute alone cannot compensate
Can transformers learn to solve new problems within episodes? Explores whether transformer models can develop meta-learning abilities through RL training, enabling them to adapt to unseen environments by learning from within-episode experience alone, without updating weights.
ICRL demonstrates successful in-context adaptation where vanilla LLMs fail; the difference: ICRL's RL fine-tuning explicitly trains the exploration-exploitation trade-off, while this note shows LLMs cannot learn to explore from language patterns alone
Do language models learn differently from good versus bad outcomes? Do LLMs update their beliefs asymmetrically when learning from their own choices versus observing others? This matters for understanding whether agentic AI systems might inherit human cognitive biases.
provides a cognitive mechanism for exploration failure: optimism bias toward chosen actions creates a self-reinforcing exploitation loop that external summarization may bypass by providing objective history
Why do large language models explore less effectively than humans? This research investigates why LLMs make decisions too quickly during open-ended exploration tasks. It examines whether the problem lies in training data, prompt engineering, or something deeper in how transformer architectures process information over time.
provides the mechanistic complement: this note documents the behavioral failure (need for external summarization); the empowerment note identifies the architectural cause (uncertainty signals in early blocks preempt empowerment signals in middle blocks, producing premature exploitation over exploration)

Why do LLMs struggle with exploration in simple decision tasks?

Inquiring lines that read this note 16

Related concepts in this collection 6

Related papers in this collection 8

Search by related questions 4