How do parameter scaling and latent vectors interact in language models?
This explores whether you can make a language model more capable by scaling something other than its parameter count — specifically latent vectors (compact internal representations the model learns to reason over) — and how those two levers play off each other.
This explores whether you can make a language model more capable by scaling something other than its parameter count — specifically latent vectors (compact internal representations the model reasons over) — and how those two levers interact. The short version from the corpus: latent vectors open up scaling dimensions that parameters alone don't, but they come with their own ceilings, and raw parameter count turns out to be a surprisingly blunt instrument.
The clearest demonstration that latent size is its own scaling axis comes from Latent-Thought Language Models, which split learning into two clocks — fast local learning over latent thought vectors and slow global learning of the decoder Can latent thought vectors scale language models beyond parameters?. Few-shot reasoning improves as you scale *either* model size or latent size, meaning you can buy capability by giving the model more room to think rather than more weights. That reframes scaling as multi-dimensional rather than a single parameter-count dial.
The corpus also keeps puncturing the assumption that more parameters is the answer. Tiny models do better with depth than width — MobileLLM composes abstract concepts through layers instead of spreading weight across width, which suggests how parameters are *arranged* matters more than how many there are Does depth matter more than width for tiny language models?. And scale isn't even the right lever for some skills: pretraining scale drives factual knowledge in lower layers while fine-tuning scale drives helpful behavior in upper layers, so the two decouple cleanly Do pretraining and fine-tuning scale independently in language models?. On hard problems, parameters hit a wall entirely — models plateau at ~55-60% on constraint-satisfaction tasks regardless of size Do larger language models solve constrained optimization better?.
Here's the twist worth knowing: the latent space isn't a reliable scratchpad for actual computation. When researchers checked whether models *execute* iterative numerical methods in latent space, they found models instead pattern-match to memorized templates and emit plausible-but-wrong values — and crucially, this failure persists across model scale Do large language models actually perform iterative optimization?. So you can scale latent capacity, but that doesn't automatically buy genuine latent-space reasoning. There's a difference between having more internal representational room and actually computing in it.
Where latent vectors clearly *do* earn their keep is as control knobs and memory. Conditioning a model on session- and turn-level latent variables makes simulated users measurably realistic Can controlled latent variables make LLM user simulators realistic?, and adaptive neural memory modules let models scale to 2M+ token contexts by compressing surprising tokens rather than storing everything in attention Can neural memory modules scale language models beyond attention limits?. The throughline across all of these: parameters store what the model knows, but latent structure governs what it can flexibly do with that knowledge — and the two scale on separate budgets, sometimes pulling in different directions.
Sources 7 notes
Latent-Thought Language Models achieve superior sample and parameter efficiency by coupling fast local variational learning with slow global decoder learning. This dual-rate scheme scales few-shot reasoning across both model and latent size, creating independent scaling dimensions beyond traditional parameter scaling.
MobileLLM shows deep-and-thin architectures yield 2.7–4.3% accuracy gains over balanced designs at 125M–350M scale by composing abstract concepts through layers rather than spreading parameters across width.
Emulated Fine-Tuning reveals that scaling pretraining improves factual knowledge while scaling fine-tuning improves behavioral helpfulness. This decoupling has architectural roots: pretraining enriches lower-layer knowledge storage, while fine-tuning modifies upper-layer behavior expression.
Across constrained-optimization tasks, LLMs converge to ~55–60% constraint satisfaction independent of architecture, parameter count, or training regime. Reasoning models do not systematically outperform standard models, suggesting a fundamental ceiling rather than a scaling gap.
Research shows LLMs cannot perform iterative procedures in latent space. They recognize optimization problems as template-similar and emit plausible-looking but incorrect values, a failure mode that persists across model scale and training approaches.
RecLLM demonstrates that conditioning an LLM simulator on session-level (user profile) and turn-level (user intent) latent variables produces synthetic conversations measurable as realistic via crowdsource discrimination, discriminator models, and classifier-ensemble distribution matching.
Titans architecture separates attention (short-term, quadratic) from neural memory (long-term, compressed), prioritizing surprising tokens for storage. The model outperforms standard Transformers and linear RNNs across tasks while scaling to 2M+ token contexts without quadratic penalties.