What tokens do RL-trained summarizers learn to keep for ranking?
This explores what RL-trained summarizers actually preserve when their reward is a downstream ranking score — and how that connects to a broader pattern of models learning which tokens carry the real signal.
This explores what RL-trained summarizers actually preserve when their reward is a downstream ranking score — and the short answer is that they stop writing for humans and start writing for the ranker. The clearest case in the corpus is ReLSum, which trains a summarizer using the actual relevance/ranking metric as its reward signal rather than asking it to produce fluent prose Can reinforcement learning align summarization with ranking goals?. The result is summaries that are dense and attribute-focused — they keep the concrete, distinguishing facts a ranker can score against (think product attributes in e-commerce search) and shed the connective, grammatical, human-readable tissue. The tokens that survive are the ones that move the ranking metric: recall, NDCG, and engagement all go up precisely because the summary is no longer optimizing for readability.
What makes this interesting is that the same pattern shows up when researchers crack open reasoning models token by token. Greedy likelihood-preserving pruning of reasoning chains reveals that models implicitly rank their own tokens by functional importance: symbolic-computation tokens are preserved first, while grammar and meta-discourse get pruned away earliest Which tokens in reasoning chains actually matter most?. That's the same hierarchy ReLSum learns externally through reward — keep the load-bearing content tokens, drop the fluent filler. So 'what tokens does RL keep for ranking' has a deeper answer: the load-bearing minority, whatever the task.
That minority really is small. Work on RLVR finds that only about 20% of tokens are high-entropy 'forking points' where the model makes a real decision, and training exclusively on those matches or beats full-gradient updates Do high-entropy tokens drive reasoning model improvements?. RL doesn't spread its attention evenly across the sequence — it concentrates the learning signal on the tokens that actually change outcomes. A ranking-aligned summarizer is doing the inverse-but-equivalent move: identifying the tokens that change the downstream score and protecting them.
There's a sharper, more unsettling version of this too. Some models compute the right answer in their early layers and then actively overwrite it with format-compliant filler in later layers Do transformers hide reasoning before producing filler tokens?. Which raises the flip side of your question: when we reward fluent, human-pleasing output instead of the downstream metric, we may be training models to bury the useful tokens under presentable ones. ReLSum's gain is partly a story about removing that pressure — letting the model surface signal it would otherwise hide.
If you want to push further, the corpus also questions whether tokens are even the right unit to optimize at all: one line of work argues attention distributions are a better policy target than token sequences, because attention is where the decision allocation actually happens Can optimizing attention patterns improve multimodal RL better than optimizing tokens?, and another shows ranking models ignore the order of tokens unless explicitly prompted to care Why do language models ignore temporal order in ranking?. Together they suggest 'which tokens to keep' is only half the question — 'which tokens to weight, and in what order' is the other half.
Sources 6 notes
ReLSum trains summarizers using downstream relevance scores as RL rewards, producing dense, attribute-focused summaries instead of fluent prose. This alignment to the actual ranking metric improves recall, NDCG, and user engagement in production e-commerce search.
Greedy likelihood-preserving pruning reveals six functional token categories; symbolic computation tokens are preferentially preserved while grammar and meta-discourse are pruned first. Student models trained on these pruned chains outperform those trained on frontier-model compression.
Only ~20% of tokens exhibit high entropy as pivotal reasoning decision points; RLVR primarily adjusts these forking tokens. Training exclusively on them matches or exceeds full-gradient performance, revealing that the minority carries the learning signal.
Logit lens analysis shows models trained with hidden CoT tokens compute correct answers in layers 1-3, then actively suppress these representations in final layers to produce format-compliant filler output. The reasoning is fully recoverable from lower-ranked token predictions.
Reinforced Attention Learning treats attention patterns as the primary policy target rather than token sequences. Direct optimization of information allocation shows stronger gains on visual reasoning than standard RLHF, because attention is where the actual decision happens.
LLMs can extract preferences from interaction histories but disregard temporal order by default. Recency-focused prompts and in-context examples activate latent order-sensitivity, improving ranking without retraining.