SYNTHESIS NOTE

Can transformers learn to solve new problems within episodes?

Explores whether transformer models can develop meta-learning abilities through RL training, enabling them to adapt to unseen environments by learning from within-episode experience alone, without updating weights.

Synthesis note · 2026-02-22 · sourced from LLM Architecture

"RL + Transformer = A General-Purpose Problem Solver" (2501.14176) demonstrates that a pre-trained transformer fine-tuned with RL over multiple episodes develops In-Context Reinforcement Learning (ICRL) — an emergent ability to solve problems never encountered during training by learning within the episode context.

Llama 3.1 8B, fine-tuned using DQN on parametric Frozen Lake games, achieves several capabilities simultaneously:

Solves unseen in-distribution environments with remarkable sample efficiency
Shows strong performance on out-of-distribution environments
Is robust to the quality of its training data
Stitches together behaviors from its context in a piecemeal fashion
Adapts to non-stationary environments

The mechanism is meta-learning via RL. The model adapts its policy based on the history of interactions within an episode — learning from its own within-episode experience without any weight updates. This parallels DeepMind's finding that transformer-based agents trained with meta-RL adapt to complex tasks within timescales comparable to human learning.

The critical distinction from standard fine-tuning: ICRL doesn't teach the model to solve specific problems. It teaches the model to learn to solve problems from experience. The training objective (RL over multiple episodes with varying configurations) creates a meta-learning pressure that the transformer architecture can exploit through its context window. Since Why do trajectories matter more than individual examples for in-context learning?, ICRL's multi-episode training naturally provides the trajectory burstiness property that enables sequential decision-making ICL to emerge.

Since Does RL teach reasoning or just when to use it?, ICRL extends this principle: RL doesn't just teach when to reason, it teaches when and how to learn within context. The base model already has the capacity for in-context adaptation; RL post-training activates and refines this meta-learning capacity.

Since Do base models already contain hidden reasoning ability?, ICRL suggests that meta-learning capability may be another latent capacity that RL activates rather than creates. The pre-trained model's in-context learning ability is the substrate; RL post-training shapes it into in-context reinforcement learning.

Inquiring lines that read this note 4

This note is a source for these research framings, grouped by the broader line of inquiry each explores. Scan the bold lines of inquiry; follow any specific question forward.

How does example difficulty affect learning efficiency in language models?

How do transformers generate harder solutions when mostly trained on easier problems?

Do base models contain latent reasoning that training can unlock?

Does RL training activate latent meta-learning capacity or create it from scratch?

What determines success in training models on multiple tasks?

How do transformers stitch together learned behaviors when adapting to new tasks?

Does recurrence enable reasoning capabilities that fixed-depth transformers cannot achieve?

Can recurrent transformers learn genuinely new computations beyond inference stages?

Related concepts in this collection 6

This note in its neighbourhood — explore the map, then jump to a related concept in the list below.

Concept map

17 direct connections · 153 in 2-hop network ·dense cluster Open in graph ↗

Can transformers learn to solve new problems wit… Does RL teach reasoning or just when to use it? Do base models already contain hidden reasoning ab… Can agents learn from failure without updating the… Why do trajectories matter more than individual ex… Why do LLMs struggle with exploration in simple de… Can LLMs handle multiple tasks at once during infe…

Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph

your link semantically near linked from elsewhere

Does RL teach reasoning or just when to use it? Does reinforcement learning in thinking models actually create new reasoning abilities, or does it simply teach existing capabilities when to activate? This matters for understanding where reasoning truly emerges.
ICRL extends: RL activates meta-learning, not just reasoning
Do base models already contain hidden reasoning ability? Explores whether reasoning capability emerges during pre-training as a latent feature rather than being created by post-training methods like reinforcement learning or fine-tuning.
meta-learning as another latent capability
Can agents learn from failure without updating their weights? Explores whether language models can improve through trial and error by storing reflections in episodic memory rather than fine-tuning. This matters because it suggests a fundamentally different path to agent adaptation.
ICRL is the RL-trained version of episodic learning
Why do trajectories matter more than individual examples for in-context learning? Can language models learn new sequential decision-making tasks from context alone, and if so, what data properties make this possible? This explores why isolated state-action pairs fail where full trajectories succeed.
trajectory burstiness specifies the data property that enables ICRL: same-level trajectories in training data create the meta-learning pressure that ICRL exploits; ICRL's generalization to unseen environments depends on having encountered bursty trajectory distributions during RL fine-tuning
Why do LLMs struggle with exploration in simple decision tasks? This explores why large language models fail at exploration—a core decision-making capability—even when they excel at other tasks, and what specific conditions might help them succeed.
ICRL demonstrates successful in-context adaptation via RL, while this note shows exploration failure in LLM agents; the difference may be that ICRL's RL fine-tuning specifically trains the exploration-exploitation trade-off, while vanilla LLMs must approximate it from language patterns alone
Can LLMs handle multiple tasks at once during inference? Do language models maintain multiple distinct in-context learning tasks simultaneously in their internal representations, and if so, what prevents them from actually generating outputs for more than one task?
task superposition provides the representational substrate for ICRL: the model can maintain multiple task interpretations from in-context experience simultaneously, enabling meta-learning across environment variations within a single episode

Can transformers learn to solve new problems within episodes?

Inquiring lines that read this note 4

Related concepts in this collection 6

Related papers in this collection 8

Search by related questions 4