SYNTHESIS NOTE
Model Architecture and Internals

Why is predicting latents more sample-efficient than tokens?

Explores whether learning from a network's own abstract representations requires far fewer training samples than learning from raw tokens, and what mechanism drives this efficiency gap.

Synthesis note · 2026-06-03 · sourced from Training Fine Tuning

Generative models reach striking performance but pay a data cost biological learners do not — frontier LLMs train on 10¹³–10¹⁴ tokens, five-plus orders of magnitude beyond what a child encounters. One hypothesis for closing that gap is that learning is most efficient not at the level of raw tokens but in a more abstract latent space, as in data2vec and JEPA, which predict a network's own latent representations of related views or masked regions.

This paper gives that intuition a quantitative footing. Using a tractable probabilistic context-free grammar (a random hierarchy model of depth L that captures compositional structure in language and images), it proves a sharp separation: supervised or token-level self-supervision needs samples exponential in depth L to recover the latent tree, while latent prediction recovers it with samples constant in L, up to logarithmic factors (concretely, scaling as m³ vs m^{L+1}). The result is confirmed with a hierarchical clustering algorithm, an end-to-end predictor-clusterer network, and the first sample-complexity analysis of data2vec.

The mechanism is the keeper: latents at the same level of the hierarchy are far more correlated with each other than they are with raw tokens, so predicting from one's own latents amplifies a signal that token-level prediction dilutes. This connects to Do base models already contain hidden reasoning ability? — both point to representations the network already forms being the efficient locus of learning — and grounds the empirical promise behind world-model work like Can a single regularizer prevent JEPA representation collapse?.

Inquiring lines that use this note as a source 22

This note is a source for these synthesized inquiries. Follow a line forward into its question, or open it to trace back to all of its sources.

Related concepts in this collection 3

This note in its neighbourhood — explore the map, then jump to a related concept in the list below.

Concept map
13 direct connections · 124 in 2-hop network ·dense cluster Open in graph ↗

Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph

your link semantically near linked from elsewhere

Related papers in this collection 8

Papers most semantically related to this note, ranked by cosine similarity in the embedding space.

Original note title

predicting your own latents is exponentially more sample-efficient than token-level prediction because same-level latents are far more correlated than tokens