SYNTHESIS NOTE
Training, RL, and Test-Time Scaling Model Architecture and Internals

Can transformers learn to solve new problems within episodes?

Explores whether transformer models can develop meta-learning abilities through RL training, enabling them to adapt to unseen environments by learning from within-episode experience alone, without updating weights.

Synthesis note · 2026-02-22 · sourced from LLM Architecture
How should we allocate compute budget at inference time? What kind of thing is an LLM really? How should researchers navigate LLM reasoning research?

"RL + Transformer = A General-Purpose Problem Solver" (2501.14176) demonstrates that a pre-trained transformer fine-tuned with RL over multiple episodes develops In-Context Reinforcement Learning (ICRL) — an emergent ability to solve problems never encountered during training by learning within the episode context.

Llama 3.1 8B, fine-tuned using DQN on parametric Frozen Lake games, achieves several capabilities simultaneously:

The mechanism is meta-learning via RL. The model adapts its policy based on the history of interactions within an episode — learning from its own within-episode experience without any weight updates. This parallels DeepMind's finding that transformer-based agents trained with meta-RL adapt to complex tasks within timescales comparable to human learning.

The critical distinction from standard fine-tuning: ICRL doesn't teach the model to solve specific problems. It teaches the model to learn to solve problems from experience. The training objective (RL over multiple episodes with varying configurations) creates a meta-learning pressure that the transformer architecture can exploit through its context window. Since Why do trajectories matter more than individual examples for in-context learning?, ICRL's multi-episode training naturally provides the trajectory burstiness property that enables sequential decision-making ICL to emerge.

Since Does RL teach reasoning or just when to use it?, ICRL extends this principle: RL doesn't just teach when to reason, it teaches when and how to learn within context. The base model already has the capacity for in-context adaptation; RL post-training activates and refines this meta-learning capacity.

Since Do base models already contain hidden reasoning ability?, ICRL suggests that meta-learning capability may be another latent capacity that RL activates rather than creates. The pre-trained model's in-context learning ability is the substrate; RL post-training shapes it into in-context reinforcement learning.

Inquiring lines that use this note as a source 4

This note is a source for these synthesized inquiries. Follow a line forward into its question, or open it to trace back to all of its sources.

Related concepts in this collection 6

This note in its neighbourhood — explore the map, then jump to a related concept in the list below.

Concept map
17 direct connections · 151 in 2-hop network ·medium cluster Open in graph ↗

Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph

your link semantically near linked from elsewhere

Related papers in this collection 8

Papers most semantically related to this note, ranked by cosine similarity in the embedding space.

Original note title

in-context reinforcement learning enables transformers to meta-learn from episode experience — generalizing to unseen environments without weight updates