Can next-token prediction train models to optimize for communication efficiency?
This explores whether the next-token prediction objective quietly pushes models toward efficient encoding — compressing meaning into the fewest, most informative tokens — rather than just guessing the next symbol.
This explores whether next-token prediction, by its nature, trains models to be efficient communicators — and the corpus suggests a split answer: prediction reliably produces *statistical* efficiency, but that's not the same thing as communicating meaning efficiently. Start with the fact that next-token prediction is, underneath, a compression objective. To predict well you have to model where language is redundant and where it's surprising, and entropy turns out to be the currency the models actually trade in. The Byte Latent Transformer makes this explicit: it segments raw bytes into patches by next-byte entropy, spending almost no compute on predictable stretches and concentrating it on the surprising ones Can byte-level models match tokenized performance with better efficiency?. That is communication-efficiency behavior — allocate effort proportional to information content — emerging directly from the prediction loss, not bolted on.
The same entropy logic shows up from several angles. Neural memory modules decide what to store by surprise, holding onto the tokens that carry new information and letting predictable ones pass Can neural memory modules scale language models beyond attention limits?. And when models are tuned with reinforcement, the learning signal turns out to live in a small high-entropy minority — roughly 20% of tokens act as the pivotal forking points, and training on those alone matches full updates Do high-entropy tokens drive reasoning model improvements?. Most tokens are cheap filler; a few are where the real bits are. Models even seem to *know* this internally: prune a reasoning chain by what it can afford to lose, and grammar and meta-discourse go first while symbolic-computation tokens are preserved Which tokens in reasoning chains actually matter most?. That's a learned sense of which words are load-bearing — efficiency awareness, arguably, falling out of pure prediction.
But here's where the question gets sharper. Efficiency of *prediction* is not efficiency of *communication* in the human sense — conveying intent with minimal waste. Bender and Koller's argument cuts here: a model trained on form-to-form prediction has no access to the communicative intents that ground meaning, so it can compress the signal without ever optimizing for what the signal is *for* Can language models learn meaning from text patterns alone?. The model gets ruthlessly good at the statistics of the channel while being blind to the message. So next-token prediction optimizes the *encoding*, not the *communication*.
There's even evidence that token-by-token prediction works *against* efficient meaning-packaging at the unit level. Concept-aware fine-tuning argues that next-token prediction fragments coherent ideas across many steps, and that switching to multi-token prediction forms more coherent semantic entities — sometimes outperforming full next-token tuning Can models learn multi-token concepts during fine-tuning?. Locally, predicting one token at a time is efficient; semantically, it scatters a single concept across a sequence of guesses.
The most interesting twist is that the objective is malleable. Reinforcement Pre-Training reframes next-token prediction itself as a reasoning task, drawing verifiable rewards straight from the corpus Can next-token prediction become a reasoning task with RL? — evidence that you can bend what the prediction signal optimizes for. So the honest answer: next-token prediction *does* train a deep, entropy-driven efficiency, and that's a real and underappreciated thing. But it's the efficiency of a compressor, not a communicator — and closing that gap takes more than predicting form.
Sources 7 notes
The Byte Latent Transformer (BLT) dynamically segments bytes into patches based on next-byte entropy, allocating more compute to high-entropy regions and less to predictable ones. At 8B parameters, BLT matches tokenized baselines while reducing inference cost and improving robustness to typos and cross-lingual transfer.
Titans architecture separates attention (short-term, quadratic) from neural memory (long-term, compressed), prioritizing surprising tokens for storage. The model outperforms standard Transformers and linear RNNs across tasks while scaling to 2M+ token contexts without quadratic penalties.
Only ~20% of tokens exhibit high entropy as pivotal reasoning decision points; RLVR primarily adjusts these forking tokens. Training exclusively on them matches or exceeds full-gradient performance, revealing that the minority carries the learning signal.
Greedy likelihood-preserving pruning reveals six functional token categories; symbolic computation tokens are preferentially preserved while grammar and meta-discourse are pruned first. Student models trained on these pruned chains outperform those trained on frontier-model compression.
Bender & Koller argue that meaning requires the relation between expressions and communicative intents. Since LLMs are trained only on form-to-form prediction with no access to shared attention or intent, they cannot reconstruct the meaning that grounds language.
CAFT successfully brings multi-token prediction to post-training via self-distilled auxiliary heads, outperforming next-token fine-tuning on tasks like protein design. CAFT LoRA even outperforms full next-token fine-tuning, suggesting models learn more effectively in multi-token settings.
Reinforcement Pre-Training transforms next-token prediction into a reasoning task by providing verifiable rewards from the corpus itself, eliminating reward hacking and enabling inference-time scaling during pretraining. This suggests token-level reasoning patterns during pretraining strengthen downstream RL fine-tuning.