SYNTHESIS NOTE

Topics›Reasoning Architectures›this note

Why do trajectories matter more than individual examples for in-context learning?

Can language models learn new sequential decision-making tasks from context alone, and if so, what data properties make this possible? This explores why isolated state-action pairs fail where full trajectories succeed.

Synthesis note · 2026-02-22 · sourced from Reasoning Architectures

In-context learning for supervised tasks works by providing a few input-output examples. Naively applying this to sequential decision making (providing a few state-action pairs) fails to enable ICL of new tasks. The key finding: the context must contain full or partial trajectories from the same environment level as the query — not just isolated examples. This property is called trajectory burstiness.

Why the difference matters: In supervised learning, examples can be from different instances — the model learns the function mapping. In sequential decision making, the model must generalize from the same level/environment to handle the wide range of states it may encounter at deployment. A sparse set of state-action pairs doesn't cover the state space; full trajectories do.

Trajectory burstiness is the probability that a given input sequence contains at least two trajectories from the same level. When this property is present in pre-training data, the model acquires the capacity to learn new tasks from demonstrations at inference time without weight updates.

Additional factors that increase ICL performance:

Larger model and dataset size
More task diversity in pre-training
Environment stochasticity (forces generalization over trajectory variation)
Higher trajectory burstiness in pre-training data

Generalization scope demonstrated: Train/test tasks differ greatly — different states, actions, dynamics, and reward functions. The model generalizes from, e.g., platform games to maze navigation from a handful of expert demonstrations. This is substantially harder than prior work that generalizes across reward function variants of the same environment.

The implication for dataset construction: sequential decision-making ICL requires a data distribution property (trajectory burstiness) that standard language modeling data does not naturally contain. This is a data structural requirement, not just a scale requirement.

This connects to Does training data format shape reasoning strategy more than domain? — here the structural property is at the trajectory level rather than the reasoning step level, but the principle is the same: data structure determines capability.

Inquiring lines that read this note 46

This note is a source for these research framings, grouped by the broader line of inquiry each explores. Scan the bold lines of inquiry; follow any specific question forward.

Do language models understand semantics or rely on pattern matching?

Why does removing language from its context destroy what makes it work?

How do training priors constrain what context information can override?

What determines success in training models on multiple tasks?

How do self-generated feedback mechanisms enable effective model learning?

How do training objectives shape what a world model actually learns?

Why does reinforcement learning suppress output diversity compared to supervised fine-tuning?

When does natural context diversity reduce the need for explicit exploration?

Can alternative training methods improve on supervised fine-tuning for language models?

Do language models learn genuine linguistic structure or just surface patterns?

Why do context-sensitive languages transfer better than regular or context-free languages?

What pretraining choices and baseline capability constrain reinforcement learning gains?

How do multi-agent systems achieve genuine cooperation and reasoning?

What role does sequence model in-context learning play in multi-agent cooperation?

Do base models contain latent reasoning that training can unlock?

Do emergent abilities result from genuine new capabilities or implicit in-context learning?

Can prompting inject entirely new knowledge into language models?

What memory architectures best support persistent reasoning across extended interactions?

Should GUI agents use structured representations instead of raw pixels?

What temporal signals in screen recordings matter most for task understanding?

What properties determine whether reward signals teach genuine reasoning?

How does credit assignment work across many sequential decision steps in language models?

What limits mechanistic interpretability's ability to characterize models?

What role does a model's representational structure play in learning?

How should dialogue recommender systems manage conversation history and state?

Does decoupling planning from execution improve multi-step reasoning accuracy?

How do chunk-based step segmentation and trajectory structure modeling differ?

How should retrieval systems optimize for multi-step reasoning during inference?

What computational cost does trajectory-bursty inference impose on per-query context requirements?

Why do continual learning scenarios trigger catastrophic forgetting and interference?

Does environment stochasticity force models to generalize better across trajectory variations?

How does sequence length affect sparsity tolerance in models?

Can activation sparsity patterns guide the selection of in-context learning demonstrations?

Can self-supervised signals enable process supervision without human annotation?

Can trajectory structure alone provide process supervision without human annotation?

How should agents balance memory condensation to optimize context efficiency?

Can agents compress long trajectories without losing critical decision context?

Is model self-awareness based on genuine introspection or pattern matching?

Does input surprise drive the implicit recognition of on-policy context?

Does fine-tuning modify underlying model capabilities or only behavioral outputs?

Why does the order of training examples matter for what models learn?

Does recurrence enable reasoning capabilities that fixed-depth transformers cannot achieve?

What data properties enable transformers to learn sequential decision-making in context?

How can AI agents autonomously learn and transfer skills across tasks?

How do training data properties shape reasoning capability development?

What makes a good in-context learning example for a given task?

Do language models develop causal world models or rely on statistical patterns?

Does next-state prediction alone build mechanistic world models or just sophisticated interpolation?

Related concepts in this collection 5

This note in its neighbourhood — explore the map, then jump to a related concept in the list below.

Concept map

17 direct connections · 179 in 2-hop network ·dense cluster Open in graph ↗

Why do trajectories matter more than individual … Does training data format shape reasoning strategy… What do models actually learn from chain-of-though… Can we allocate inference compute based on prompt … Can LLMs handle multiple tasks at once during infe… Can transformers learn to solve new problems withi…

Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph

your link semantically near linked from elsewhere

Does training data format shape reasoning strategy more than domain? What explains why models trained on multiple-choice data reason differently than those trained on free-form text? The research isolates format and domain effects to measure which one matters more.
trajectory burstiness is another case where data structure determines emergent capability
What do models actually learn from chain-of-thought training? When models train on reasoning demonstrations, do they memorize content details or absorb reasoning structure? Testing with corrupted data reveals which aspects of CoT samples actually drive learning.
structural properties of training data drive learning; applies at both the reasoning trace and trajectory levels
Can we allocate inference compute based on prompt difficulty? Does adjusting how much compute each prompt receives—rather than using a fixed budget—improve model performance? Could smarter allocation let smaller models compete with larger ones?
the context-length requirements for trajectory-bursty inference raise per-query compute costs
Can LLMs handle multiple tasks at once during inference? Do language models maintain multiple distinct in-context learning tasks simultaneously in their internal representations, and if so, what prevents them from actually generating outputs for more than one task?
task superposition may be the representational mechanism enabling trajectory-bursty ICL: the model maintains multiple task interpretations from in-context trajectories simultaneously before committing to a single policy at generation time
Can transformers learn to solve new problems within episodes? Explores whether transformer models can develop meta-learning abilities through RL training, enabling them to adapt to unseen environments by learning from within-episode experience alone, without updating weights.
ICRL is the RL-trained capability that trajectory burstiness enables: same-level trajectories create the meta-learning pressure during training that ICRL exploits at inference time for adaptation to unseen environments

Why do trajectories matter more than individual examples for in-context learning?

Inquiring lines that read this note 46

Related concepts in this collection 5

Related papers in this collection 8

Search by related questions 4