Why do LLMs struggle with exploration in simple decision tasks?
This explores why large language models fail at exploration—a core decision-making capability—even when they excel at other tasks, and what specific conditions might help them succeed.
Decision-making agents require three core capabilities: generalization (supervised learning), exploration (making suboptimal short-term decisions to gather information), and planning (accounting for long-term consequences). LLMs have been shown to possess generalization and limited planning. Exploration turns out to be the hardest.
In systematic evaluation across multi-armed bandit environments — one of the simplest exploration problems — only a single LLM/prompt configuration achieves satisfactory exploratory behavior: GPT-4 + explicit exploratory hints + external per-arm history summarization + zero-shot chain-of-thought. All other configurations fail, including GPT-4 with just the explicit hints, or with CoT but without external summarization.
The critical factor is external history summarization. Without it, the model must track which arms have been tried and what returns were obtained purely from the raw interaction history in context. When the history grows long, this becomes an effective in-context computation problem — the model must maintain and update a running average per arm from unstructured context. LLMs appear to fail this computation reliably.
External summarization converts unstructured history (list of (arm, reward) tuples) into structured per-arm aggregates that are trivially readable. With this pre-processing, GPT-4 can then apply exploratory reasoning correctly.
The negative interpretation matters: External summarization is a non-trivial algorithm design problem in complex environments. If the history has thousands of entries with complex structure (state, action, observation sequences), pre-processing to the right summary form is itself a hard problem. LLM exploration capability in truly complex environments is likely to remain unreliable.
This connects to Why do trajectories matter more than individual examples for in-context learning?: both findings reveal that LLMs' ICL capabilities in sequential decision-making contexts are fragile and depend on specific data presentation choices that are non-trivial to implement.
Inquiring lines that use this note as a source 16
This note is a source for these synthesized inquiries. Follow a line forward into its question, or open it to trace back to all of its sources.
- Where do LLMs succeed at generation but struggle with evaluation?
- What role does exploration-exploitation balance play in abstraction formation?
- Why do language models fail at planning despite understanding strategies?
- When does natural context diversity reduce the need for explicit exploration?
- Why do large language models fail at taking conversational initiative?
- Why do LLMs plateau on creativity tasks while humans reach further?
- Can external summarization solve exploration problems in complex real-world environments?
- Do LLMs fail exploration because of context integration or computational limitations?
- What data presentation structures enable LLMs to learn decision-making from examples?
- Why does exploration quality matter more than learner network depth?
- What distinguishes systematic search from wandering exploration in reasoning?
- Does context diversity ever make active exploration unnecessary in bandits?
- Why do LLMs generate novel ideas but struggle to evaluate them?
- What mechanism causes LLMs to plateau on numerical optimization tasks?
- Why do LLMs explain correct reasoning but then choose greedy actions?
- What causes language models' strategic rationality to decline with increased game complexity?
Related concepts in this collection 6
This note in its neighbourhood — explore the map, then jump to a related concept in the list below.
Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph
-
Why do trajectories matter more than individual examples for in-context learning?
Can language models learn new sequential decision-making tasks from context alone, and if so, what data properties make this possible? This explores why isolated state-action pairs fail where full trajectories succeed.
both reveal specific data structure requirements for LLM sequential decision making ICL
-
Why do language models ignore information in their context?
Explores why language models sometimes override contextual information with prior training associations, and whether providing more context can solve this problem.
exploration failure may involve context integration failure when unstructured history competes with parametric patterns
-
Can we allocate inference compute based on prompt difficulty?
Does adjusting how much compute each prompt receives—rather than using a fixed budget—improve model performance? Could smarter allocation let smaller models compete with larger ones?
exploration tasks have unbounded difficulty without external summarization; compute alone cannot compensate
-
Can transformers learn to solve new problems within episodes?
Explores whether transformer models can develop meta-learning abilities through RL training, enabling them to adapt to unseen environments by learning from within-episode experience alone, without updating weights.
ICRL demonstrates successful in-context adaptation where vanilla LLMs fail; the difference: ICRL's RL fine-tuning explicitly trains the exploration-exploitation trade-off, while this note shows LLMs cannot learn to explore from language patterns alone
-
Do language models learn differently from good versus bad outcomes?
Do LLMs update their beliefs asymmetrically when learning from their own choices versus observing others? This matters for understanding whether agentic AI systems might inherit human cognitive biases.
provides a cognitive mechanism for exploration failure: optimism bias toward chosen actions creates a self-reinforcing exploitation loop that external summarization may bypass by providing objective history
-
Why do large language models explore less effectively than humans?
This research investigates why LLMs make decisions too quickly during open-ended exploration tasks. It examines whether the problem lies in training data, prompt engineering, or something deeper in how transformer architectures process information over time.
provides the mechanistic complement: this note documents the behavioral failure (need for external summarization); the empowerment note identifies the architectural cause (uncertainty signals in early blocks preempt empowerment signals in middle blocks, producing premature exploitation over exploration)
Related papers in this collection 8
Papers most semantically related to this note, ranked by cosine similarity in the embedding space.
- Can large language models explore in-context?
- From Trial-and-Error to Improvement: A Systematic Analysis of LLM Exploration Mechanisms in RLVR
- Large Language Models Think Too Fast To Explore Effectively
- Teaching Large Language Models to Reason with Reinforcement Learning
- LLMs are Greedy Agents: Effects of RL Fine-tuning on Decision-Making Abilities
- Outcome-based Exploration for LLM Reasoning
- Cognitive Architectures for Language Agents
- Reasoning LLMs are Wandering Solution Explorers
Original note title
llms fail at in-context exploration without external summarization and explicit exploratory prompting even with strong base capabilities