Why does scaling data and model size improve compositional generalization?
This explores why neural networks get better at combining known pieces into novel wholes (compositional generalization) simply by adding more data and parameters — and what that scaling actually buys you versus what it can't.
This explores why scaling helps models combine known building blocks into new combinations — and the corpus has a genuinely contested answer, which is the interesting part. The cleanest pro-scaling result is that plain MLPs achieve compositional generalization with no architectural tricks at all, as long as the training distribution covers enough combinations of the underlying task modules; the tell-tale sign of success is that you can linearly read off the individual constituents from the hidden activations Can neural networks learn compositional skills without symbolic mechanisms?. So scaling 'works' partly because more data covers more of the combination space, and more capacity lets the model carve out clean, separable representations of each part.
The deeper mechanism shows up in how networks organize themselves internally. Pruning studies find that networks naturally split compositional tasks into isolated modular subnetworks — ablate one and only its matching function breaks — and crucially, pretraining (i.e., more data) makes this modular structure far more consistent and reliable Do neural networks naturally learn modular compositional structure?. That reframes scaling: it isn't teaching a symbolic rule, it's making the reusable parts cleaner and more reliably separated. The same theme appears in architecture choices — for tiny models, depth beats width because composing abstract concepts across layers matters more than spreading parameters sideways Does depth matter more than width for tiny language models?. Composition is something the network builds up in stages, and giving it more stages or more reliable parts helps.
But here's the part you didn't know you wanted to know: several notes argue scaling produces the *appearance* of composition without the real thing. One line of work shows transformers succeed in-distribution by memorizing computation subgraphs from training, then fail drastically on genuinely novel compositions, with errors compounding across reasoning steps Do transformers actually learn systematic compositional reasoning?. The 'binding problem' framing explains why: networks struggle to dynamically bind distributed pieces into new structures, and while scaling can *partially* paper over this by letting compositional representations emerge, it doesn't dissolve the underlying limit Why do neural networks fail at compositional generalization?. And there's a hard ceiling result — on genuine constrained-optimization tasks, models plateau around 55–60% regardless of parameter count or training regime, suggesting some compositional gaps are structural, not scaling gaps Do larger language models solve constrained optimization better?.
Put together, the corpus suggests a sharp reading: scaling improves compositional generalization mainly when the task space is *coverable* — when more data fills in the combinations and more capacity sharpens the separable, reusable parts. It does less when novel composition requires binding pieces in ways never seen, where you hit the subgraph-memorization wall. That's also why the collection is full of alternatives to brute scale: composing expert vectors at inference instead of in the weights Can models dynamically activate expert skills at inference time?, decomposing a skill like function-calling into explicitly-trained subtasks so generalization is taught rather than hoped-for Can breaking function calling into subtasks improve model generalization?, and adding entirely new scaling dimensions beyond parameters via latent thought vectors Can latent thought vectors scale language models beyond parameters?. If you want to go deeper, the tension between 'emerges from scaling' and 'reduces to memorization' is the live debate worth following.
Sources 9 notes
Standard MLPs achieve compositional generalization through data and model scaling alone, without architectural modifications, provided the training distribution sufficiently covers combinations of task modules. Linear decodability of constituents from hidden activations reliably predicts success.
Pruning experiments reveal that neural networks implement compositional subroutines in isolated subnetworks, with ablations affecting only their corresponding function. Pretraining substantially increases the consistency and reliability of this modular structure across architectures and domains.
MobileLLM shows deep-and-thin architectures yield 2.7–4.3% accuracy gains over balanced designs at 125M–350M scale by composing abstract concepts through layers rather than spreading parameters across width.
Research shows transformers succeed on in-distribution tasks by memorizing computation subgraphs from training data, not by learning systematic rules. They fail drastically on novel compositions, with errors compounding across reasoning steps.
Greff et al. argue that neural networks cannot dynamically bind distributed information into compositional structures due to three failures: segregating entities from inputs, maintaining representational separation, and reusing learned structure in novel combinations. Scaling can partially overcome this by enabling compositional representations to emerge.
Across constrained-optimization tasks, LLMs converge to ~55–60% constraint satisfaction independent of architecture, parameter count, or training regime. Reasoning models do not systematically outperform standard models, suggesting a fundamental ceiling rather than a scaling gap.
Transformer2 demonstrates that tuning only singular values within weight matrices produces composable expert vectors that dynamically mix at inference without interference, outperforming LoRA with fewer parameters and enabling continual specialization.
Granite-20B-FunctionCalling shows that explicit training across seven granular subtasks—nested calls, chaining, parallel functions, name detection, parameter detection, next-best function, and response generation—generalizes better than umbrella datasets like ToolLLM. This multi-task approach closes the performance gap with GPT, Claude, and Gemini.
Latent-Thought Language Models achieve superior sample and parameter efficiency by coupling fast local variational learning with slow global decoder learning. This dual-rate scheme scales few-shot reasoning across both model and latent size, creating independent scaling dimensions beyond traditional parameter scaling.