Why is predicting latents more sample-efficient than tokens?
Explores whether learning from a network's own abstract representations requires far fewer training samples than learning from raw tokens, and what mechanism drives this efficiency gap.
Generative models reach striking performance but pay a data cost biological learners do not — frontier LLMs train on 10¹³–10¹⁴ tokens, five-plus orders of magnitude beyond what a child encounters. One hypothesis for closing that gap is that learning is most efficient not at the level of raw tokens but in a more abstract latent space, as in data2vec and JEPA, which predict a network's own latent representations of related views or masked regions.
This paper gives that intuition a quantitative footing. Using a tractable probabilistic context-free grammar (a random hierarchy model of depth L that captures compositional structure in language and images), it proves a sharp separation: supervised or token-level self-supervision needs samples exponential in depth L to recover the latent tree, while latent prediction recovers it with samples constant in L, up to logarithmic factors (concretely, scaling as m³ vs m^{L+1}). The result is confirmed with a hierarchical clustering algorithm, an end-to-end predictor-clusterer network, and the first sample-complexity analysis of data2vec.
The mechanism is the keeper: latents at the same level of the hierarchy are far more correlated with each other than they are with raw tokens, so predicting from one's own latents amplifies a signal that token-level prediction dilutes. This connects to Do base models already contain hidden reasoning ability? — both point to representations the network already forms being the efficient locus of learning — and grounds the empirical promise behind world-model work like Can a single regularizer prevent JEPA representation collapse?.
Inquiring lines that use this note as a source 22
This note is a source for these synthesized inquiries. Follow a line forward into its question, or open it to trace back to all of its sources.
- How can neural networks be interpretable by design rather than post-hoc?
- Why does recursion on latent state drive generalization better than hierarchy?
- Why is latent-level prediction more sample-efficient than token-level prediction?
- Can predictive self-supervision work on unlabeled sequential visual data?
- What physical structure does a Gaussian-regularized latent space actually encode?
- Do discrete tokenized modalities preserve information better than continuous embeddings?
- How does upward distillation transfer knowledge from smaller to larger networks?
- What makes a feature abstract versus concrete in neural network activations?
- Why does masking the penultimate token outperform random token masking?
- What makes looped latent computation more efficient than scaling attention capacity?
- Why does attending to own latents work better than bolted-on external memory stores?
- Why does latent-level prediction beat token-level prediction for reasoning?
- Can a two-layer network outgeneralize billion-parameter models through recursion alone?
- How does representational density emerge from training data familiarity?
- Can training order and structure shape what networks retain and learn?
- Can data pruning and equal contribution be reconciled in optimal learning?
- How do latents at the same hierarchy level become more correlated than tokens?
- Does latent density emerge during pretraining from training data familiarity?
- What prevents representation collapse in latent-prediction world models like JEPA?
- What makes representation interventions more efficient than weight perturbations for finetuning?
- What are the concrete efficiency gains of linear-attention state-space models?
- Should user context live in tokens or in learned model representations?
Related concepts in this collection 3
This note in its neighbourhood — explore the map, then jump to a related concept in the list below.
Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph
-
Can a single regularizer prevent JEPA representation collapse?
JEPAs traditionally need complex loss stacks and auxiliary tricks to avoid collapse. Can a single Gaussian-distribution constraint on latent embeddings do the same stabilization work, and would that simplify training?
the theory behind why latent-prediction world models work; LeWM is the practical instantiation
-
Do base models already contain hidden reasoning ability?
Explores whether reasoning capability emerges during pre-training as a latent feature rather than being created by post-training methods like reinforcement learning or fine-tuning.
both argue the network's own representations are the efficient learning target
-
Is representational sparsity learned or intrinsic to neural networks?
Explores whether sparsity in neural network activations is engineered through training or emerges as a default response to unfamiliar inputs. Understanding this distinction could reshape how we design and interpret model behavior.
adjacent account of how latent structure forms during training
Related papers in this collection 8
Papers most semantically related to this note, ranked by cosine similarity in the embedding space.
- Learn from your own latents and not from tokens: A sample-complexity theory
- Break It Down: Evidence for Structural Compositionality in Neural Networks
- Reasoning to Learn from Latent Thoughts
- Scalable Language Models with Posterior Inference of Latent Thought Vectors
- Farther the Shift, Sparser the Representation: Analyzing OOD Mechanisms in LLMs
- Scaling up Test-Time Compute with Latent Reasoning: A Recurrent Depth Approach
- Nested Learning: The Illusion of Deep Learning Architectures
- Alice in Wonderland: Simple Tasks Showing Complete Reasoning Breakdown in State-Of-the-Art Large Language Models
Original note title
predicting your own latents is exponentially more sample-efficient than token-level prediction because same-level latents are far more correlated than tokens