SYNTHESIS NOTE
Training, RL, and Test-Time Scaling Model Architecture and Internals

Can optimizing attention patterns improve multimodal RL better than optimizing tokens?

Standard RL training optimizes token outputs in multimodal models, but the real bottleneck may be where the model attends to visual information. Does steering attention directly outperform indirect optimization through final outputs?

Synthesis note · 2026-05-18 · sourced from LLM Architecture

Standard post-training with RL improves reasoning in language models by optimizing token-level outputs. Extending the same paradigm to multimodal LLMs through verbose rationales yields limited gains for perception tasks and can even degrade performance. The diagnosis in Reinforced Attention Learning is that next-token prediction is the wrong policy objective when the actual bottleneck is information allocation in attention.

The mechanism: in MLLM architectures, visual inputs are encoded as tokens and projected into the textual embedding space. Accurate visual question-answering requires the model to precisely identify and attend to task-relevant visual information. This identification is the work of the attention mechanism — assigning high weights to salient multimodal tokens. Standard RLHF optimizes the result (the output token sequence) rather than the process (the internal information allocation). The policy gradient never reaches where the real decision happens.

RAL reformulates the post-training policy to operate directly on the attention distribution during generation. When a response receives high reward, the algorithm encourages the underlying attention pattern by minimizing divergence between the current attention and a reference. When reward is low, the model is penalized by increasing divergence from those sub-optimal attention patterns. Attention becomes the policy object; tokens become a downstream observable.

This is structurally distinct from RLHF. RLHF is outcome-based RL where the gradient flows from a scalar reward through the token-generation chain. RAL is process-aware RL where the gradient flows directly to attention distributions, treating the information-allocation step as a first-class policy. The two are not interchangeable — they reinforce different aspects of the model's behavior.

The pattern generalizes. Wherever the bottleneck on a task is internal to the model rather than at the output, optimizing the output is a leaky channel for steering the bottleneck. Attention here, but in principle: gating decisions in MoE, retrieval choices in RAG, tool-selection in agents — all candidates for direct policy optimization rather than mediated optimization through final outputs.

Inquiring lines that use this note as a source 10

This note is a source for these synthesized inquiries. Follow a line forward into its question, or open it to trace back to all of its sources.

Related concepts in this collection 3

This note in its neighbourhood — explore the map, then jump to a related concept in the list below.

Concept map
14 direct connections · 104 in 2-hop network ·medium cluster Open in graph ↗

Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph

your link semantically near linked from elsewhere

Related papers in this collection 8

Papers most semantically related to this note, ranked by cosine similarity in the embedding space.

Original note title

attention distributions are first-class policy optimization targets for multimodal RL — optimizing where to attend beats optimizing what to generate