INQUIRING LINE

Does the prediction unit shape what language models actually learn?

This explores whether the *thing a model is trained to predict* — the next token, the surface form, the autoregressive step — quietly decides what kinds of knowledge and capability the model can end up with.


This explores whether the *unit of prediction* — what the model is asked to guess, token by token, left to right — shapes what it can actually learn, rather than that being a neutral implementation detail. The corpus says it does, in ways that reach surprisingly far. The starkest version: a model trained purely on *form-to-form* prediction has no channel to acquire meaning, because meaning lives in the relation between expressions and what speakers intend by them, and that relation never appears in the text stream Can language models learn meaning from text patterns alone?. The prediction unit here isn't just "a token" — it's "the next symbol given prior symbols," and that framing draws a hard boundary around what's learnable.

The same boundary shows up as predictable failure. If you treat an LLM as a machine that maximizes the probability of the next response, you can forecast *where* it breaks before testing it: tasks whose correct answers are low-probability under the training distribution — reciting the alphabet backwards, counting letters — stay hard even when they're logically trivial Can we predict where language models will fail?. The autoregressive objective leaves "embers" on everything the model does. A related consequence is that prompting and prompt optimization can only *reorganize* what the prediction objective already absorbed — they activate latent knowledge but cannot inject what was never in the distribution Can prompt optimization teach models knowledge they lack?, and strong parametric priors learned during training can override the context sitting right in front of the model Why do language models ignore information in their context?.

Here's the turn you might not expect: because the prediction unit is so determinative, several lines of work try to *change the unit itself* to change what's learned. Post-Completion Learning keeps training on the normally-discarded space after the end-of-sequence token, so the model learns to predict an evaluation of its own output — folding self-assessment into the objective at zero inference cost Can models learn to evaluate their own work during training?. Latent-Thought Language Models add a second prediction target — slow global latent variables alongside fast token-level decoding — opening a scaling dimension that has nothing to do with parameter count Can latent thought vectors scale language models beyond parameters?. And Titans changes *which* tokens get privileged: a neural memory module preferentially stores surprising ones, so what the model retains over long contexts is shaped by a surprise signal rather than uniform next-token pressure Can neural memory modules scale language models beyond attention limits?.

The through-line worth carrying away: the prediction unit doesn't just train a model, it *defines the shape of the box* the model lives in. It explains why LLMs sample a character from a superposition rather than committing to one — generation is a draw from a distribution, not a retrieval of a fixed state Do large language models actually commit to a single character?. So the most interesting frontier in the collection isn't squeezing more from next-token prediction — it's quietly redefining what "the next thing to predict" even is.


Sources 8 notes

Can language models learn meaning from text patterns alone?

Bender & Koller argue that meaning requires the relation between expressions and communicative intents. Since LLMs are trained only on form-to-form prediction with no access to shared attention or intent, they cannot reconstruct the meaning that grounds language.

Can we predict where language models will fail?

By framing LLMs as autoregressive probability machines, researchers predicted tasks with low-probability target responses would be systematically harder, even when logically simple. Experiments confirmed predictions like backwards alphabet and letter counting.

Can prompt optimization teach models knowledge they lack?

Prompting works entirely within a model's pre-existing training distribution and cannot supply domain knowledge absent from training data. This creates a hard ceiling: no prompt strategy can compensate for missing foundational knowledge, only reorganize what already exists.

Why do language models ignore information in their context?

Research demonstrates that LMs generate outputs inconsistent with their context because parametric knowledge from training dominates over in-context information. Textual prompting alone cannot override strong priors; causal intervention in representations is required.

Can models learn to evaluate their own work during training?

Post-Completion Learning exploits unused sequence space after model output to train self-assessment capabilities during training while maintaining zero inference cost. The model learns to compute its own reward functions, internalizing evaluation rather than relying on external reward models.

Can latent thought vectors scale language models beyond parameters?

Latent-Thought Language Models achieve superior sample and parameter efficiency by coupling fast local variational learning with slow global decoder learning. This dual-rate scheme scales few-shot reasoning across both model and latent size, creating independent scaling dimensions beyond traditional parameter scaling.

Can neural memory modules scale language models beyond attention limits?

Titans architecture separates attention (short-term, quadratic) from neural memory (long-term, compressed), prioritizing surprising tokens for storage. The model outperforms standard Transformers and linear RNNs across tasks while scaling to 2M+ token contexts without quadratic penalties.

Do large language models actually commit to a single character?

Shanahan's 20-questions test shows LLMs maintain a superposition of consistent objects or characters and sample from that distribution at generation time. Regenerating the same response yields different outputs, each consistent with prior context, proving no fixed commitment exists.

Research prompt for your LLMexpand ↓

Copy into ChatGPT or Claude to take this line of inquiry further — it asks the model to find newer work and re-test which earlier constraints still hold.

You are an LLM researcher re-testing whether the prediction unit (what a model is asked to guess—token-by-token, left-to-right) fundamentally constrains what it learns. This question has been live since ~2023–2025 across a curated library.

What a curated library found — and when (dated claims, not current truth):
• Form-to-form prediction alone cannot acquire meaning because meaning requires speaker intent, absent from text streams (2023–2024).
• Autoregressive objectives leave predictable failure signatures: tasks low-probability under training (alphabet backwards, letter counting) remain hard even when logically trivial (2024).
• Prompt optimization only *activates* latent knowledge; it cannot inject what wasn't in the training distribution (2024–2025).
• Changing the prediction unit itself reshapes learning: Post-Completion Learning folds self-assessment into the objective via post-EOS space (2025); Latent-Thought Language Models add slow latent-variable targets orthogonal to token scaling (2025); Titans uses surprise signals to preferentially memorize over uniform next-token pressure (2024–2025).
• LLM generation is a draw from a distribution, not retrieval of a committed internal state (2024).

Anchor papers (verify; mind their dates):
• arXiv:2305.10601 (Tree of Thoughts, 2023)
• arXiv:2501.00663 (Titans, 2024)
• arXiv:2502.01567 (Latent-Thought LMs, 2025)
• arXiv:2507.20252 (Post-Completion Learning, 2025)

Your task:
(1) RE-TEST EACH CONSTRAINT. For every claim above, determine whether newer models (o1, gpt-4o, Claude 4), training methods (RL from human feedback, constitutional AI, self-play), or orchestration (chain-of-thought, multi-turn memory, tool use) have since RELAXED or OVERTURNED it. Separate the durable question (e.g., "does the objective shape representable knowledge?") from the perishable limitation (e.g., "next-token prediction alone cannot do X"). State plainly where constraints still hold.
(2) Surface the strongest CONTRADICTING or SUPERSEDING work from the last ~6 months—papers showing that prediction units matter less than the library suggests, or that architectural/training changes make the unit nearly irrelevant.
(3) Propose 2 research questions that ASSUME the regime may have moved: e.g., "If latent-thought scaling is real, does it emerge naturally during standard training or only under explicit architectural design?" or "Can multi-turn, context-dependent prediction units (not left-to-right) relax the meaning-acquisition barrier?"

Cite arXiv IDs; flag anything you cannot ground in a real paper.

Next inquiring lines