SYNTHESIS NOTE

Why is predicting latents more sample-efficient than tokens?

Explores whether learning from a network's own abstract representations requires far fewer training samples than learning from raw tokens, and what mechanism drives this efficiency gap.

Synthesis note · 2026-06-03 · sourced from Training Fine Tuning

Generative models reach striking performance but pay a data cost biological learners do not — frontier LLMs train on 10¹³–10¹⁴ tokens, five-plus orders of magnitude beyond what a child encounters. One hypothesis for closing that gap is that learning is most efficient not at the level of raw tokens but in a more abstract latent space, as in data2vec and JEPA, which predict a network's own latent representations of related views or masked regions.

This paper gives that intuition a quantitative footing. Using a tractable probabilistic context-free grammar (a random hierarchy model of depth L that captures compositional structure in language and images), it proves a sharp separation: supervised or token-level self-supervision needs samples exponential in depth L to recover the latent tree, while latent prediction recovers it with samples constant in L, up to logarithmic factors (concretely, scaling as m³ vs m^{L+1}). The result is confirmed with a hierarchical clustering algorithm, an end-to-end predictor-clusterer network, and the first sample-complexity analysis of data2vec.

The mechanism is the keeper: latents at the same level of the hierarchy are far more correlated with each other than they are with raw tokens, so predicting from one's own latents amplifies a signal that token-level prediction dilutes. This connects to Do base models already contain hidden reasoning ability? — both point to representations the network already forms being the efficient locus of learning — and grounds the empirical promise behind world-model work like Can a single regularizer prevent JEPA representation collapse?.

Inquiring lines that read this note 29

This note is a source for these research framings, grouped by the broader line of inquiry each explores. Scan the bold lines of inquiry; follow any specific question forward.

What limits mechanistic interpretability's ability to characterize models?

How does latent reasoning compare to verbalized chain-of-thought?

Why does recursion on latent state drive generalization better than hierarchy?

Can next-token prediction alone produce genuine language understanding?

Can self-supervised signals enable process supervision without human annotation?

Can predictive self-supervision work on unlabeled sequential visual data?

What articulatory information do speech signals carry that text cannot?

Do discrete tokenized modalities preserve information better than continuous embeddings?

What makes weaker teacher models effective for stronger student training?

How does upward distillation transfer knowledge from smaller to larger networks?

How does sequence length affect sparsity tolerance in models?

What memory architectures best support persistent reasoning across extended interactions?

Why does attending to own latents work better than bolted-on external memory stores?

When does architectural design matter more than raw model capacity?

Can a two-layer network outgeneralize billion-parameter models through recursion alone?

Why do continual learning scenarios trigger catastrophic forgetting and interference?

Does model scaling alone produce compositional generalization without symbolic mechanisms?

Why does finetuning cause catastrophic forgetting of model capabilities?

What makes representation interventions more efficient than weight perturbations for finetuning?

How do training priors constrain what context information can override?

Should user context live in tokens or in learned model representations?

Why do semantic similarity and task relevance diverge in vector embeddings?

Can generative reconstruction preserve latent manifold structure better than geometric compression?

Do language models learn genuine linguistic structure or just surface patterns?

Can we balance interpretability with the efficiency gains of compressed inter-model communication?

Does recurrence enable reasoning capabilities that fixed-depth transformers cannot achieve?

Can latent recurrence achieve the depth that standard transformers cannot?

How do policy learning algorithm choices affect multi-objective optimization stability?

Can trust region constraints prevent the sample inefficiency problems of RLHF?

Related concepts in this collection 3

This note in its neighbourhood — explore the map, then jump to a related concept in the list below.

Concept map

14 direct connections · 135 in 2-hop network ·dense cluster Open in graph ↗

Why is predicting latents more sample-efficient … Can a single regularizer prevent JEPA representati… Do base models already contain hidden reasoning ab… Is representational sparsity learned or intrinsic …

Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph

your link semantically near linked from elsewhere

Can a single regularizer prevent JEPA representation collapse? JEPAs traditionally need complex loss stacks and auxiliary tricks to avoid collapse. Can a single Gaussian-distribution constraint on latent embeddings do the same stabilization work, and would that simplify training?
the theory behind why latent-prediction world models work; LeWM is the practical instantiation
Do base models already contain hidden reasoning ability? Explores whether reasoning capability emerges during pre-training as a latent feature rather than being created by post-training methods like reinforcement learning or fine-tuning.
both argue the network's own representations are the efficient learning target
Is representational sparsity learned or intrinsic to neural networks? Explores whether sparsity in neural network activations is engineered through training or emerges as a default response to unfamiliar inputs. Understanding this distinction could reshape how we design and interpret model behavior.
adjacent account of how latent structure forms during training

Why is predicting latents more sample-efficient than tokens?

Inquiring lines that read this note 29

Related concepts in this collection 3

Related papers in this collection 8

Search by related questions 4