INQUIRING LINE

Inquiring lines›How should we train models for cap…›How do different training strategi…›What pretraining choices and basel…›this inquiring line

What if you trained AI by rewarding where it focuses, rather than just grading the words it produces?

Can RL directly optimize attention distributions instead of text generation?

This explores whether reinforcement learning can treat where a model 'looks' — its attention distribution — as the thing being optimized, rather than the usual target of which tokens it emits.

This explores whether RL can directly optimize attention distributions instead of the usual target — the text a model generates. The corpus has a direct answer, and it's yes: Reinforced Attention Learning treats attention patterns as the primary policy target, and on multimodal visual reasoning it beats standard token-level RLHF Can optimizing attention patterns improve multimodal RL better than optimizing tokens?. The intuition is clean — attention is where the model actually allocates information and commits to a decision, so optimizing that allocation reaches the bottleneck more directly than nudging the output tokens that come downstream of it.

What makes this more than a one-paper curiosity is how it rhymes with a broader finding about *where* RL does its work. When you measure what RL actually changes inside a model, it updates only 5–30% of parameters, and those sparse updates are nearly full-rank and nearly identical across random seeds — structural, not arbitrary Does reinforcement learning update only a small fraction of parameters?. That hints that RL is already implicitly concentrating its pressure on a small, decision-critical substrate. Making attention itself the explicit target is, in a sense, naming that substrate and optimizing it on purpose rather than hoping token-level rewards trickle back to it.

The corpus also reframes what 'the policy' even has to be. RL on language models is usually described as a single-turn token-prediction game, but it scales cleanly to long-horizon, multi-turn software tasks with delayed rewards Can reinforcement learning scale beyond single-turn language tasks?, and the reward signal itself can be swapped out — black-box recommendation metrics like NDCG can train an LLM directly Can recommendation metrics train language models directly?, or natural-language critiques can replace scalar rewards when numbers plateau because they carry information about *why* a generation failed Can natural language feedback overcome numerical reward plateaus?. Once both the reward and the horizon are this flexible, the policy target — tokens vs. attention — looks like just one more design choice rather than a fixed law.

There's a cautionary thread worth knowing about too. RL doesn't just optimize; it collapses diversity, converging on a single dominant pretraining format within the first epoch and suppressing the alternatives Does RL training collapse format diversity in pretrained models?. If you point that same collapsing pressure directly at attention distributions, you'd want to ask whether it sharpens the model onto the genuinely informative regions — or just narrows where it's willing to look. And from a different angle, the field is also exploring *architectural* control over attention rather than RL control: separating short-term attention from a neural memory module that decides which surprising tokens are worth storing Can neural memory modules scale language models beyond attention limits?. The interesting tension the corpus leaves you with is that attention allocation can be governed two ways — learned through reward, or built into the architecture — and these aren't yet talking to each other.

Sources 7 notes

Can optimizing attention patterns improve multimodal RL better than optimizing tokens?

Reinforced Attention Learning treats attention patterns as the primary policy target rather than token sequences. Direct optimization of information allocation shows stronger gains on visual reasoning than standard RLHF, because attention is where the actual decision happens.

Does reinforcement learning update only a small fraction of parameters?

Across seven RL algorithms and ten LLM families, RL induces intrinsic parameter sparsity of 5–30% without explicit regularization. Critically, these sparse updates are nearly full-rank and nearly identical across random seeds, indicating structural rather than arbitrary parameter selection.

Can reinforcement learning scale beyond single-turn language tasks?

Modified DAPO training doubled SWE-bench Verified performance from 20% to 39% on Qwen2.5-72B, matching larger models. This demonstrates RL works in stateful multi-step environments with delayed rewards and complex feedback, beyond theoretical single-turn MDPs.

Can recommendation metrics train language models directly?

Rec-R1 demonstrates that LLMs can be trained directly on rule-based recommendation metrics like NDCG and Recall as RL reward signals, eliminating the need for SFT distillation from proprietary models while remaining model-agnostic across different retriever architectures.

Can natural language feedback overcome numerical reward plateaus?

Critique-GRPO shows that models stuck on performance plateaus can generate correct solutions when given chain-of-thought critiques, revealing that numerical rewards lack critical information about why failures occur and how to improve.

Show all 7 sources

Does RL training collapse format diversity in pretrained models?

Controlled experiments show RL consistently amplifies one format distribution from pretraining within the first epoch while collapsing alternatives. The winning format depends on model scale, not necessarily performance, and is largely hidden when starting from proprietary pretrained models.

Can neural memory modules scale language models beyond attention limits?

Titans architecture separates attention (short-term, quadratic) from neural memory (long-term, compressed), prioritizing surprising tokens for storage. The model outperforms standard Transformers and linear RNNs across tasks while scaling to 2M+ token contexts without quadratic penalties.

Papers this line draws on 8

The research behind the notes this line reads — ranked by how closely each paper relates.

The Art of Scaling Reinforcement Learning Compute for LLMs2.51 match · arxiv ↗
Echo Chamber: RL Post-training Amplifies Behaviors Learned in Pretraining1.71 match · arxiv ↗
Teaching Large Language Models to Reason with Reinforcement Learning1.69 match · arxiv ↗
Mechanistically Interpreting the Role of Sample Difficulty in RLVR for LLMs1.65 match · arxiv ↗
Reinforcement Learning for Reasoning in Large Language Models with One Training Example1.65 match · arxiv ↗
SFT Memorizes, RL Generalizes: A Comparative Study of Foundation Model Post-training1.62 match · arxiv ↗
Training Long-Context, Multi-Turn Software Engineering Agents with Reinforcement Learning0.92 match · arxiv ↗
Titans: Learning to Memorize at Test Time0.91 match · arxiv ↗

Research prompt for your LLMexpand ↓

Copy into ChatGPT or Claude to take this line of inquiry further — it asks the model to find newer work and re-test which earlier constraints still hold.

You are an LLM researcher re-testing constraints on direct RL optimization of attention distributions. The question remains: Can RL target attention patterns as a primary policy objective, outperforming token-level reward?

What a curated library found — and when (findings span 2023–2026, dated claims, not current truth):
• Reinforced Attention Learning optimizes attention distributions directly on multimodal reasoning tasks and outperforms token-level RLHF (~2026).
• RL updates only 5–30% of model parameters in sparse, full-rank subnetworks across seeds, suggesting decision-critical substrates exist (~2025).
• RL scales to long-horizon, multi-turn software engineering with delayed rewards (~2025).
• Natural-language critiques as rewards break numerical plateaus by encoding *why* failures occur (~2025).
• RL post-training converges on a single dominant pretraining distribution format within the first epoch, collapsing diversity (~2025).

Anchor papers (verify; mind their dates):
• arXiv:2602.04884 — Reinforced Attention Learning (2026).
• arXiv:2505.11711 — Reinforcement Learning Finetunes Small Subnetworks in Large Language Models (2025).
• arXiv:2508.03501 — Training Long-Context, Multi-Turn Software Engineering Agents with RL (2025).
• arXiv:2504.07912 — Echo Chamber: RL Post-training Amplifies Behaviors Learned in Pretraining (2025).

Your task:
(1) RE-TEST EACH CONSTRAINT. Has the diversity collapse (single-format convergence) been mitigated by recent methods like multi-objective RL, adapter-based fine-tuning, or mixture-of-experts routing? Does directing RL at attention *prevent* collapse onto narrow attention patterns, or does it amplify it? Separate the durable question (whether attention is a valid policy target) from perishable limitations (whether diversity loss is inevitable).
(2) Surface the strongest contradicting or superseding work from the last ~6 months—especially architectural alternatives (memory modules, sparse attention) that compete with learned attention control.
(3) Propose 2 research questions assuming the regime may have moved: (a) Can hybrid objectives (attention + token logits + diversity penalties) outperform single-target RL? (b) Does attention-targeted RL generalize to out-of-distribution reasoning tasks, or does it overfit to training attention patterns?

Cite arXiv IDs; flag anything you cannot ground in a real paper.

What if you trained AI by rewarding where it focuses, rather than just grading the words it produces?

Related lines of inquiry

Sources 7 notes

Papers this line draws on 8