SYNTHESIS NOTE
Training, RL, and Test-Time Scaling Reasoning, Retrieval, and Evaluation Model Architecture and Internals

Why do LLMs struggle with exploration in simple decision tasks?

This explores why large language models fail at exploration—a core decision-making capability—even when they excel at other tasks, and what specific conditions might help them succeed.

Synthesis note · 2026-02-22 · sourced from Reasoning Architectures

Decision-making agents require three core capabilities: generalization (supervised learning), exploration (making suboptimal short-term decisions to gather information), and planning (accounting for long-term consequences). LLMs have been shown to possess generalization and limited planning. Exploration turns out to be the hardest.

In systematic evaluation across multi-armed bandit environments — one of the simplest exploration problems — only a single LLM/prompt configuration achieves satisfactory exploratory behavior: GPT-4 + explicit exploratory hints + external per-arm history summarization + zero-shot chain-of-thought. All other configurations fail, including GPT-4 with just the explicit hints, or with CoT but without external summarization.

The critical factor is external history summarization. Without it, the model must track which arms have been tried and what returns were obtained purely from the raw interaction history in context. When the history grows long, this becomes an effective in-context computation problem — the model must maintain and update a running average per arm from unstructured context. LLMs appear to fail this computation reliably.

External summarization converts unstructured history (list of (arm, reward) tuples) into structured per-arm aggregates that are trivially readable. With this pre-processing, GPT-4 can then apply exploratory reasoning correctly.

The negative interpretation matters: External summarization is a non-trivial algorithm design problem in complex environments. If the history has thousands of entries with complex structure (state, action, observation sequences), pre-processing to the right summary form is itself a hard problem. LLM exploration capability in truly complex environments is likely to remain unreliable.

This connects to Why do trajectories matter more than individual examples for in-context learning?: both findings reveal that LLMs' ICL capabilities in sequential decision-making contexts are fragile and depend on specific data presentation choices that are non-trivial to implement.

Inquiring lines that use this note as a source 16

This note is a source for these synthesized inquiries. Follow a line forward into its question, or open it to trace back to all of its sources.

Related concepts in this collection 6

This note in its neighbourhood — explore the map, then jump to a related concept in the list below.

Concept map
16 direct connections · 160 in 2-hop network ·dense cluster Open in graph ↗

Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph

your link semantically near linked from elsewhere

Related papers in this collection 8

Papers most semantically related to this note, ranked by cosine similarity in the embedding space.

Original note title

llms fail at in-context exploration without external summarization and explicit exploratory prompting even with strong base capabilities