INQUIRING LINE

What data presentation structures enable LLMs to learn decision-making from examples?

This explores how the *structure and format* of example data — not just having examples — determines whether an LLM can actually learn to make decisions, drawing together finetuning corpora, in-context history, memory schemas, and search trees as competing 'presentation' strategies.


This explores how decision-making examples have to be *shaped* before an LLM can learn from them — the same raw experience can be packaged as a finetuning dataset, a running context summary, a memory store, or a search tree, and the packaging is often what decides whether learning happens at all. The corpus suggests the format is doing as much work as the data.

The most direct evidence comes from curated example collections. When LLMs are finetuned on the trial-by-trial records of psychology experiments, they end up predicting human choices better than purpose-built cognitive theories, and the structure of the data lets them capture individual differences and transfer across tasks they were never tuned for Can language models learn to model human decision making?. The lesson isn't 'more data' — it's that decision episodes presented as clean behavioral sequences are a learnable substrate in a way that abstract theory is not.

But in-context examples behave very differently from finetuning examples, and here presentation structure becomes decisive. In simple bandit tasks, models can't reliably learn to explore from a raw running history of what they tried and what happened — they only succeed when that history is *externally summarized*, paired with explicit exploratory hints and chain-of-thought scaffolding Why do LLMs struggle with exploration in simple decision tasks?. So the same examples that fail as an undigested transcript succeed once someone restructures them into an aggregated summary. That points to a structural ceiling on learning from unprocessed sequential experience.

Two lines of work answer that ceiling by giving the examples an explicit architecture. Memory-based learning stores experience in typed modules — case, subtask, and tool memory — so that credit assignment and improvement happen entirely through memory operations, no weight updates, reaching 87.88% on GAIA Can agents learn continuously from experience without updating weights?. Tree search does something parallel: by laying decision paths out as a branching structure, the success or failure at the leaves becomes a dense, rankable signal, letting the model derive process-level reward without any human annotation Can tree search replace human feedback in LLM training?. In both cases the *shape* — a tree, a partitioned memory — is what converts scattered outcomes into something the model can learn a policy from.

The deeper reason all this structuring is needed surfaces in the work on why LLM decision-making is fragile to begin with. Models tend to *explain* a decision principle correctly yet fail to *apply* it, a disconnect between knowing and doing Can LLMs understand concepts they cannot apply?, and their reasoning wanders unsystematically so success collapses on deeper problems Why do reasoning LLMs fail at deeper problem solving?. One promising fix reframes the whole problem: parameterize the policy in language and refine it with environmental feedback, so declarative knowledge and procedural skill get unified through the example loop itself — which also slashes how much data you need and keeps every step explainable Can language modeling close the knowing-doing gap in AI?. The thread running through all of these is that you don't just feed an LLM decisions and hope — you choose a structure (sequence, summary, memory, tree, or language-policy) that turns examples into a signal it can climb.


Sources 7 notes

Can language models learn to model human decision making?

LLMs finetuned on psychology experiment data predict human behavior more accurately than theory-driven models in decision tasks, capture individual differences in their embeddings, and transfer learning across tasks without task-specific design.

Why do LLMs struggle with exploration in simple decision tasks?

Across multi-armed bandit environments, only GPT-4 with explicit exploratory hints, external history summarization, and chain-of-thought reasoning achieves satisfactory exploration. Without external summarization, models cannot reliably track and aggregate unstructured interaction history to guide exploratory decisions.

Can agents learn continuously from experience without updating weights?

AgentFly formalizes agent learning as a Memory-augmented MDP with three memory modules (case, subtask, tool) that enable credit assignment and policy improvement entirely through memory operations. The approach achieved 87.88% on GAIA validation without modifying LLM parameters.

Can tree search replace human feedback in LLM training?

AlphaLLM uses tree search outcomes and three critic models to derive dense reward signals equivalent to human-labeled feedback. Tree structure naturally ranks solution paths by success, replacing the annotation oracle that standard RLHF requires.

Can LLMs understand concepts they cannot apply?

Models can explain concepts accurately, fail to apply them, and recognize the failure—a triple pattern incompatible with human cognition. This indicates functionally disconnected explanation and execution pathways rather than simple knowledge gaps.

Why do reasoning LLMs fail at deeper problem solving?

Current reasoning models lack the three properties of systematic exploration: validity, effectiveness, and necessity. This causes success probability to drop exponentially with problem depth, making medium problems solvable but deep problems catastrophically harder.

Can language modeling close the knowing-doing gap in AI?

Think-In Games demonstrates that when LLMs generate language-guided policies refined by environmental feedback, they develop procedural competence while retaining explainability. The approach dramatically reduces data demands and makes agent reasoning transparent at every step.

Research prompt for your LLMexpand ↓

Copy into ChatGPT or Claude to take this line of inquiry further — it asks the model to find newer work and re-test which earlier constraints still hold.

You are a research analyst. This question remains open: **What data presentation structures enable LLMs to learn decision-making from examples?** Treat the findings below as dated claims (spanning 2024–2026), not current truth, and re-test them against the latest models and methods.

What a curated library found — and when (dated claims, not current truth):
• Finetuning on trial-by-trial psychological experiment records lets LLMs predict human choices and transfer across unseen tasks, outperforming cognitive theories (2024).
• In-context learning from raw decision histories fails; LLMs only learn exploration when histories are externally summarized with explicit chain-of-thought scaffolding (2024).
• Memory-based learning (typed case/subtask/tool modules) and Monte Carlo tree search convert scattered outcomes into learnable policy signals without weight updates, achieving 87.88% on GAIA (2025).
• LLMs exhibit "Potemkin understanding": they explain decision principles correctly but fail to apply them; reasoning wanders unsystematically on deeper tasks (2025–2026).
• Language-parameterized policies refined with environmental feedback unify declarative and procedural knowledge, reducing data requirements (2025).

Anchor papers (verify; mind their dates):
• arXiv:2403.15371 (2024-03): Can large language models explore in-context?
• arXiv:2404.12253 (2024-04): Toward Self-Improvement of LLMs via Imagination, Searching, and Criticizing
• arXiv:2505.20296 (2025-05): Reasoning LLMs are Wandering Solution Explorers
• arXiv:2508.21365 (2025-08): Think in Games: Learning to Reason in Games via RL with LLM

Your task:
(1) RE-TEST EACH CONSTRAINT. For every structural claim above, judge whether newer models (GPT-4o, Claude 4, o1-class reasoners), in-context window scaling, retrieval-augmented memory, or multi-agent orchestration have since RELAXED or OVERTURNED it. Separate the durable question (still open) from perishable limitations (possibly resolved); cite what resolved each.
(2) Surface the strongest CONTRADICTING or SUPERSEDING work from the last ~6 months, especially any that show unstructured histories *can* yield learning, or that challenge the explanation–application gap.
(3) Propose 2 research questions that ASSUME the regime may have shifted—e.g., can longer contexts or tool-use orchestration dissolve the need for external summarization?

Cite arXiv IDs; flag anything you cannot ground in a real paper.

Next inquiring lines