What data presentation structures enable LLMs to learn decision-making from examples?
This explores how the *structure and format* of example data — not just having examples — determines whether an LLM can actually learn to make decisions, drawing together finetuning corpora, in-context history, memory schemas, and search trees as competing 'presentation' strategies.
This explores how decision-making examples have to be *shaped* before an LLM can learn from them — the same raw experience can be packaged as a finetuning dataset, a running context summary, a memory store, or a search tree, and the packaging is often what decides whether learning happens at all. The corpus suggests the format is doing as much work as the data.
The most direct evidence comes from curated example collections. When LLMs are finetuned on the trial-by-trial records of psychology experiments, they end up predicting human choices better than purpose-built cognitive theories, and the structure of the data lets them capture individual differences and transfer across tasks they were never tuned for Can language models learn to model human decision making?. The lesson isn't 'more data' — it's that decision episodes presented as clean behavioral sequences are a learnable substrate in a way that abstract theory is not.
But in-context examples behave very differently from finetuning examples, and here presentation structure becomes decisive. In simple bandit tasks, models can't reliably learn to explore from a raw running history of what they tried and what happened — they only succeed when that history is *externally summarized*, paired with explicit exploratory hints and chain-of-thought scaffolding Why do LLMs struggle with exploration in simple decision tasks?. So the same examples that fail as an undigested transcript succeed once someone restructures them into an aggregated summary. That points to a structural ceiling on learning from unprocessed sequential experience.
Two lines of work answer that ceiling by giving the examples an explicit architecture. Memory-based learning stores experience in typed modules — case, subtask, and tool memory — so that credit assignment and improvement happen entirely through memory operations, no weight updates, reaching 87.88% on GAIA Can agents learn continuously from experience without updating weights?. Tree search does something parallel: by laying decision paths out as a branching structure, the success or failure at the leaves becomes a dense, rankable signal, letting the model derive process-level reward without any human annotation Can tree search replace human feedback in LLM training?. In both cases the *shape* — a tree, a partitioned memory — is what converts scattered outcomes into something the model can learn a policy from.
The deeper reason all this structuring is needed surfaces in the work on why LLM decision-making is fragile to begin with. Models tend to *explain* a decision principle correctly yet fail to *apply* it, a disconnect between knowing and doing Can LLMs understand concepts they cannot apply?, and their reasoning wanders unsystematically so success collapses on deeper problems Why do reasoning LLMs fail at deeper problem solving?. One promising fix reframes the whole problem: parameterize the policy in language and refine it with environmental feedback, so declarative knowledge and procedural skill get unified through the example loop itself — which also slashes how much data you need and keeps every step explainable Can language modeling close the knowing-doing gap in AI?. The thread running through all of these is that you don't just feed an LLM decisions and hope — you choose a structure (sequence, summary, memory, tree, or language-policy) that turns examples into a signal it can climb.
Sources 7 notes
LLMs finetuned on psychology experiment data predict human behavior more accurately than theory-driven models in decision tasks, capture individual differences in their embeddings, and transfer learning across tasks without task-specific design.
Across multi-armed bandit environments, only GPT-4 with explicit exploratory hints, external history summarization, and chain-of-thought reasoning achieves satisfactory exploration. Without external summarization, models cannot reliably track and aggregate unstructured interaction history to guide exploratory decisions.
AgentFly formalizes agent learning as a Memory-augmented MDP with three memory modules (case, subtask, tool) that enable credit assignment and policy improvement entirely through memory operations. The approach achieved 87.88% on GAIA validation without modifying LLM parameters.
AlphaLLM uses tree search outcomes and three critic models to derive dense reward signals equivalent to human-labeled feedback. Tree structure naturally ranks solution paths by success, replacing the annotation oracle that standard RLHF requires.
Models can explain concepts accurately, fail to apply them, and recognize the failure—a triple pattern incompatible with human cognition. This indicates functionally disconnected explanation and execution pathways rather than simple knowledge gaps.
Current reasoning models lack the three properties of systematic exploration: validity, effectiveness, and necessity. This causes success probability to drop exponentially with problem depth, making medium problems solvable but deep problems catastrophically harder.
Think-In Games demonstrates that when LLMs generate language-guided policies refined by environmental feedback, they develop procedural competence while retaining explainability. The approach dramatically reduces data demands and makes agent reasoning transparent at every step.