INQUIRING LINE

Can in-context learning replicate the timing effects that RL teaches models?

This explores whether in-context learning — steering a model at inference time with examples and no weight updates — can reproduce the sequence- and phase-dependent effects that reinforcement learning bakes in during training.


This explores whether in-context learning can stand in for RL's timing effects, and the corpus splits the question into two halves that point in opposite directions. The first half is encouraging. A recurring finding is that RL doesn't teach models genuinely new reasoning — it surfaces strategies already latent in pretraining. Verifiable-reward RL acts as a catalyst that improves sampling efficiency within existing capability boundaries rather than expanding them, to the point that a single example or even spurious rewards can trigger the gains What does reward learning actually do to model reasoning? How does RL training reshape reasoning and what gets lost?. If RL is mostly activation rather than construction, then a lot of what it 'teaches' is in principle reachable without weight updates at all — you just need the right context to call the same priors forward.

And in-context learning can in fact reach temporal, sequential behavior. The trajectory-burstiness work shows models learning sequential decision-making purely in context — but only when the context contains full or partial trajectories from the same environment, not isolated examples Why do trajectories matter more than individual examples for in-context learning?. That's a striking parallel: timing structure can be conveyed through the *shape* of the prompt, not just through gradient steps. The same flavor shows up in how instruction tuning works — what transfers is knowledge of the output space and format, not semantic task understanding, since models trained on deliberately wrong instructions perform about as well Does instruction tuning teach task understanding or output format?. Much of what looks like 'learning' is really format and distribution activation, exactly the kind of thing a well-built context can supply.

The second half is where replication breaks down. RL's most interesting timing effects are properties of the *training trajectory itself*, and those have no in-context analog. RL training moves through a two-phase dynamic — first execution correctness drives gains, then strategic planning becomes the bottleneck, with planning-token entropy rising while execution entropy stabilizes Does RL training follow a predictable two-phase learning sequence?. The *order* in which you train domains mechanically reshapes entropy: training structured tasks before creative ones prevents entropy collapse from wrecking open-ended ability, worth a measurable margin over joint training Does training order reshape how models handle different task types?. These are path-dependent consolidation effects — what got consolidated first constrains what can be learned later. A prompt has no first-and-then; it presents everything at once.

There's also a structural reason in-context steering can't fully impersonate RL: RL leaves a physical fingerprint. It updates only 5–30% of parameters, in sparse but near-full-rank subnetworks that are nearly identical across random seeds — structural, repeatable surgery on the weights Does reinforcement learning update only a small fraction of parameters?. It also converges the model onto a single dominant pretraining format while suppressing alternatives within the first epoch Does RL training collapse format diversity in pretrained models?. In-context learning leaves the weights untouched, so it can't make a behavior *default* the way RL does — it can only invoke it for the length of the context. Effects like calibration are a clean example: binary-reward RL provably degrades calibration and the fix is a second reward term, a training-time intervention with no obvious prompt equivalent Does binary reward training hurt model calibration?.

So the honest answer the corpus suggests: in-context learning can replicate the *capability* timing effects — surfacing latent sequential and strategic behavior, since RL was largely activating those anyway — but not the *process* timing effects. The two-phase consolidation, the order-dependent entropy dynamics, and the sparse durable rewiring are artifacts of the learning trajectory, and a context window has no trajectory to speak of. What you didn't know you wanted to know: the question quietly conflates two different things RL does, and the corpus pulls them cleanly apart — one of them ICL can fake, the other it structurally cannot.


Sources 9 notes

What does reward learning actually do to model reasoning?

Research shows RLVR improves sampling efficiency within existing capability boundaries without expanding them. A single training example suffices for activation, and spurious rewards work nearly as well as correct ones for models with appropriate pretraining.

How does RL training reshape reasoning and what gets lost?

Research shows that verifiable rewards act as catalysts that surface existing capabilities from pretraining, not teachers that build new reasoning. RL updates are structurally sparse and bounded by the pretrained prior, not algorithmic sophistication.

Why do trajectories matter more than individual examples for in-context learning?

In-context learning for sequential decision-making requires full or partial trajectories from the same environment level, not just isolated examples. This structural property—trajectory burstiness—allows models to generalize across vastly different tasks without weight updates.

Does instruction tuning teach task understanding or output format?

Models trained on semantically empty or deliberately incorrect instructions achieve comparable performance to those trained on full correct instructions, achieving 43% vs random baseline 42.6%. The semantic content of instructions appears largely irrelevant; what transfers is knowledge of the output space.

Does RL training follow a predictable two-phase learning sequence?

Across eight models, RL training consistently shows a first phase where execution correctness drives learning, followed by a second phase where strategic planning becomes the bottleneck. Planning token entropy increases while execution entropy stabilizes, and concentration of optimization on planning tokens yields significant performance gains.

Does training order reshape how models handle different task types?

Omni-Thinker shows structured domains decrease output entropy while creative domains increase it. BWT-guided scheduling—training structured tasks first—yields 6.2% gains over joint training by preventing entropy collapse from damaging open-ended capabilities.

Does reinforcement learning update only a small fraction of parameters?

Across seven RL algorithms and ten LLM families, RL induces intrinsic parameter sparsity of 5–30% without explicit regularization. Critically, these sparse updates are nearly full-rank and nearly identical across random seeds, indicating structural rather than arbitrary parameter selection.

Does RL training collapse format diversity in pretrained models?

Controlled experiments show RL consistently amplifies one format distribution from pretraining within the first epoch while collapsing alternatives. The winning format depends on model scale, not necessarily performance, and is largely hidden when starting from proprietary pretrained models.

Does binary reward training hurt model calibration?

Binary correctness rewards incentivize high-confidence guessing because they don't penalize confident wrong answers. Adding the Brier score as a second reward term mathematically guarantees joint optimization of accuracy and calibration without trade-off.

Research prompt for your LLMexpand ↓

Copy into ChatGPT or Claude to take this line of inquiry further — it asks the model to find newer work and re-test which earlier constraints still hold.

You are a researcher auditing whether in-context learning can replicate RL's timing effects on LLMs. The question remains open: can prompt structure and context alone substitute for the learning trajectory that RL inscribes?

What a curated library found — and when (dated claims, not current truth):
Findings span 2023–2026. A library of RL and ICL papers suggests:
• RL mostly activates latent strategies already in pretraining, not constructing new ones; single examples or spurious rewards trigger gains (~2025).
• In-context learning can surface sequential decision-making behavior when trajectories (full or partial) fill the context, but isolated examples fail (~2023).
• RL training exhibits two-phase dynamics: execution correctness first, then planning bottlenecks; training order reshapes entropy and constrains later learning (~2025).
• RL updates only 5–30% of parameters in sparse, repeatable subnetworks and converges to a single dominant pretraining format within epoch 1 (~2025).
• Binary-reward RL provably degrades calibration; the fix requires a second reward term baked into training, with no obvious in-context analog (~2024).

Anchor papers (verify; mind their dates):
• arXiv:2305.11383 (2023-05): Instruction tuning teaches output-format distribution, not task understanding.
• arXiv:2505.11711 (2025-05): RL finetunes small, structured subnetworks.
• arXiv:2504.07912 (2025-04): RL post-training amplifies pretraining behaviors.
• arXiv:2507.14783 (2025-07): Multi-task RL reveals entropy dynamics tied to training order.

Your task:
(1) RE-TEST EACH CONSTRAINT. For capability timing effects (sequential, strategic behavior), probe whether newer ICL methods (retrieval-augmented prompting, learned in-context optimizers, or multi-shot trajectory composition) have since narrowed or closed the gap. For process timing effects (two-phase consolidation, order-dependent entropy, sparse rewriting), assess whether any training-free or prompt-only method has emerged that durably changes default behavior without weight updates. Plainly name what still resists in-context substitution.
(2) Surface work from the last ~6 months that contradicts or supersedes the claim that ICL cannot replicate process timing — particularly any evidence that context length, structure, or retrieval patterns can mimic order-dependent consolidation or enforce behavioral defaults across multiple turns.
(3) Propose 2 research questions that assume the regime may have moved: (a) Can chained or iterated in-context episodes (simulating a trajectory of contexts) approximate multi-phase consolidation? (b) Do longer, compositional prompts that encode task dependency graphs replicate the entropy-shaping that training order achieves?

Cite arXiv IDs; flag anything you cannot ground in a real paper.

Next inquiring lines