Does the prediction unit shape what language models actually learn?
This explores whether the *thing a model is trained to predict* — the next token, the surface form, the autoregressive step — quietly decides what kinds of knowledge and capability the model can end up with.
This explores whether the *unit of prediction* — what the model is asked to guess, token by token, left to right — shapes what it can actually learn, rather than that being a neutral implementation detail. The corpus says it does, in ways that reach surprisingly far. The starkest version: a model trained purely on *form-to-form* prediction has no channel to acquire meaning, because meaning lives in the relation between expressions and what speakers intend by them, and that relation never appears in the text stream Can language models learn meaning from text patterns alone?. The prediction unit here isn't just "a token" — it's "the next symbol given prior symbols," and that framing draws a hard boundary around what's learnable.
The same boundary shows up as predictable failure. If you treat an LLM as a machine that maximizes the probability of the next response, you can forecast *where* it breaks before testing it: tasks whose correct answers are low-probability under the training distribution — reciting the alphabet backwards, counting letters — stay hard even when they're logically trivial Can we predict where language models will fail?. The autoregressive objective leaves "embers" on everything the model does. A related consequence is that prompting and prompt optimization can only *reorganize* what the prediction objective already absorbed — they activate latent knowledge but cannot inject what was never in the distribution Can prompt optimization teach models knowledge they lack?, and strong parametric priors learned during training can override the context sitting right in front of the model Why do language models ignore information in their context?.
Here's the turn you might not expect: because the prediction unit is so determinative, several lines of work try to *change the unit itself* to change what's learned. Post-Completion Learning keeps training on the normally-discarded space after the end-of-sequence token, so the model learns to predict an evaluation of its own output — folding self-assessment into the objective at zero inference cost Can models learn to evaluate their own work during training?. Latent-Thought Language Models add a second prediction target — slow global latent variables alongside fast token-level decoding — opening a scaling dimension that has nothing to do with parameter count Can latent thought vectors scale language models beyond parameters?. And Titans changes *which* tokens get privileged: a neural memory module preferentially stores surprising ones, so what the model retains over long contexts is shaped by a surprise signal rather than uniform next-token pressure Can neural memory modules scale language models beyond attention limits?.
The through-line worth carrying away: the prediction unit doesn't just train a model, it *defines the shape of the box* the model lives in. It explains why LLMs sample a character from a superposition rather than committing to one — generation is a draw from a distribution, not a retrieval of a fixed state Do large language models actually commit to a single character?. So the most interesting frontier in the collection isn't squeezing more from next-token prediction — it's quietly redefining what "the next thing to predict" even is.
Sources 8 notes
Bender & Koller argue that meaning requires the relation between expressions and communicative intents. Since LLMs are trained only on form-to-form prediction with no access to shared attention or intent, they cannot reconstruct the meaning that grounds language.
By framing LLMs as autoregressive probability machines, researchers predicted tasks with low-probability target responses would be systematically harder, even when logically simple. Experiments confirmed predictions like backwards alphabet and letter counting.
Prompting works entirely within a model's pre-existing training distribution and cannot supply domain knowledge absent from training data. This creates a hard ceiling: no prompt strategy can compensate for missing foundational knowledge, only reorganize what already exists.
Research demonstrates that LMs generate outputs inconsistent with their context because parametric knowledge from training dominates over in-context information. Textual prompting alone cannot override strong priors; causal intervention in representations is required.
Post-Completion Learning exploits unused sequence space after model output to train self-assessment capabilities during training while maintaining zero inference cost. The model learns to compute its own reward functions, internalizing evaluation rather than relying on external reward models.
Latent-Thought Language Models achieve superior sample and parameter efficiency by coupling fast local variational learning with slow global decoder learning. This dual-rate scheme scales few-shot reasoning across both model and latent size, creating independent scaling dimensions beyond traditional parameter scaling.
Titans architecture separates attention (short-term, quadratic) from neural memory (long-term, compressed), prioritizing surprising tokens for storage. The model outperforms standard Transformers and linear RNNs across tasks while scaling to 2M+ token contexts without quadratic penalties.
Shanahan's 20-questions test shows LLMs maintain a superposition of consistent objects or characters and sample from that distribution at generation time. Regenerating the same response yields different outputs, each consistent with prior context, proving no fixed commitment exists.