INQUIRING LINE

Inquiring lines›How do language models construct a…›How does AI persuasion undermine h…›Does model scaling alone produce c…›this inquiring line

Does making an AI bigger teach it to combine ideas it knows into new arrangements — or just memorize more examples?

Why does scaling data and model size improve compositional generalization?

This explores why neural networks get better at combining known pieces into novel wholes (compositional generalization) simply by adding more data and parameters — and what that scaling actually buys you versus what it can't.

This explores why scaling helps models combine known building blocks into new combinations — and the corpus has a genuinely contested answer, which is the interesting part. The cleanest pro-scaling result is that plain MLPs achieve compositional generalization with no architectural tricks at all, as long as the training distribution covers enough combinations of the underlying task modules; the tell-tale sign of success is that you can linearly read off the individual constituents from the hidden activations Can neural networks learn compositional skills without symbolic mechanisms?. So scaling 'works' partly because more data covers more of the combination space, and more capacity lets the model carve out clean, separable representations of each part.

The deeper mechanism shows up in how networks organize themselves internally. Pruning studies find that networks naturally split compositional tasks into isolated modular subnetworks — ablate one and only its matching function breaks — and crucially, pretraining (i.e., more data) makes this modular structure far more consistent and reliable Do neural networks naturally learn modular compositional structure?. That reframes scaling: it isn't teaching a symbolic rule, it's making the reusable parts cleaner and more reliably separated. The same theme appears in architecture choices — for tiny models, depth beats width because composing abstract concepts across layers matters more than spreading parameters sideways Does depth matter more than width for tiny language models?. Composition is something the network builds up in stages, and giving it more stages or more reliable parts helps.

But here's the part you didn't know you wanted to know: several notes argue scaling produces the *appearance* of composition without the real thing. One line of work shows transformers succeed in-distribution by memorizing computation subgraphs from training, then fail drastically on genuinely novel compositions, with errors compounding across reasoning steps Do transformers actually learn systematic compositional reasoning?. The 'binding problem' framing explains why: networks struggle to dynamically bind distributed pieces into new structures, and while scaling can *partially* paper over this by letting compositional representations emerge, it doesn't dissolve the underlying limit Why do neural networks fail at compositional generalization?. And there's a hard ceiling result — on genuine constrained-optimization tasks, models plateau around 55–60% regardless of parameter count or training regime, suggesting some compositional gaps are structural, not scaling gaps Do larger language models solve constrained optimization better?.

Put together, the corpus suggests a sharp reading: scaling improves compositional generalization mainly when the task space is *coverable* — when more data fills in the combinations and more capacity sharpens the separable, reusable parts. It does less when novel composition requires binding pieces in ways never seen, where you hit the subgraph-memorization wall. That's also why the collection is full of alternatives to brute scale: composing expert vectors at inference instead of in the weights Can models dynamically activate expert skills at inference time?, decomposing a skill like function-calling into explicitly-trained subtasks so generalization is taught rather than hoped-for Can breaking function calling into subtasks improve model generalization?, and adding entirely new scaling dimensions beyond parameters via latent thought vectors Can latent thought vectors scale language models beyond parameters?. If you want to go deeper, the tension between 'emerges from scaling' and 'reduces to memorization' is the live debate worth following.

Sources 9 notes

Can neural networks learn compositional skills without symbolic mechanisms?

Standard MLPs achieve compositional generalization through data and model scaling alone, without architectural modifications, provided the training distribution sufficiently covers combinations of task modules. Linear decodability of constituents from hidden activations reliably predicts success.

Do neural networks naturally learn modular compositional structure?

Pruning experiments reveal that neural networks implement compositional subroutines in isolated subnetworks, with ablations affecting only their corresponding function. Pretraining substantially increases the consistency and reliability of this modular structure across architectures and domains.

Does depth matter more than width for tiny language models?

MobileLLM shows deep-and-thin architectures yield 2.7–4.3% accuracy gains over balanced designs at 125M–350M scale by composing abstract concepts through layers rather than spreading parameters across width.

Do transformers actually learn systematic compositional reasoning?

Research shows transformers succeed on in-distribution tasks by memorizing computation subgraphs from training data, not by learning systematic rules. They fail drastically on novel compositions, with errors compounding across reasoning steps.

Why do neural networks fail at compositional generalization?

Greff et al. argue that neural networks cannot dynamically bind distributed information into compositional structures due to three failures: segregating entities from inputs, maintaining representational separation, and reusing learned structure in novel combinations. Scaling can partially overcome this by enabling compositional representations to emerge.

Show all 9 sources

Do larger language models solve constrained optimization better?

Across constrained-optimization tasks, LLMs converge to ~55–60% constraint satisfaction independent of architecture, parameter count, or training regime. Reasoning models do not systematically outperform standard models, suggesting a fundamental ceiling rather than a scaling gap.

Can models dynamically activate expert skills at inference time?

Transformer2 demonstrates that tuning only singular values within weight matrices produces composable expert vectors that dynamically mix at inference without interference, outperforming LoRA with fewer parameters and enabling continual specialization.

Can breaking function calling into subtasks improve model generalization?

Granite-20B-FunctionCalling shows that explicit training across seven granular subtasks—nested calls, chaining, parallel functions, name detection, parameter detection, next-best function, and response generation—generalizes better than umbrella datasets like ToolLLM. This multi-task approach closes the performance gap with GPT, Claude, and Gemini.

Can latent thought vectors scale language models beyond parameters?

Latent-Thought Language Models achieve superior sample and parameter efficiency by coupling fast local variational learning with slow global decoder learning. This dual-rate scheme scales few-shot reasoning across both model and latent size, creating independent scaling dimensions beyond traditional parameter scaling.

Papers this line draws on 8

The research behind the notes this line reads — ranked by how closely each paper relates.

Scaling can lead to compositional generalization3.59 match · arxiv ↗
Break It Down: Evidence for Structural Compositionality in Neural Networks3.56 match · arxiv ↗
Bigger is not always better: The importance of human-scale language modeling for psycholinguistics3.31 match · arxiv ↗
How do Transformers Learn Implicit Reasoning?3.30 match · arxiv ↗
From Frege to chatGPT: Compositionality in language, cognition, and deep neural networks2.60 match · arxiv ↗
Faith and Fate: Limits of Transformers on Compositionality2.58 match · arxiv ↗
Scaling Laws Meet Model Architecture: Toward Inference-Efficient LLMs1.68 match · arxiv ↗
Scaling Laws for Neural Language Models1.68 match · arxiv ↗

Research prompt for your LLMexpand ↓

Copy into ChatGPT or Claude to take this line of inquiry further — it asks the model to find newer work and re-test which earlier constraints still hold.

You are a research analyst tracking compositional generalization in LLMs. The question remains open: *under what conditions does scaling data and model size actually enable models to recombine known elements into unseen compositions, and when does it merely memorize computation patterns?*

What a curated library found — and when (dated claims, not current truth):
Findings span 2020–2026; treat these as perishable constraints to re-test:
• Scaling works when the task space is *coverable*: more data fills combination gaps, more capacity sharpens separable, linearly-readable constituent representations (2024).
• Networks naturally organize into isolated modular subnetworks for compositional tasks; pretraining makes this modular structure far more consistent and reliable (2023).
• Transformers succeed in-distribution by memorizing subgraph computations from training, then fail drastically on genuinely novel compositions with cascading errors (2023).
• Hard ceiling: models plateau at 55–60% on genuine constrained-optimization tasks regardless of parameter count, suggesting structural compositional limits (2026).
• Alternatives bypass brute scaling: expert-vector composition at inference, explicit multi-task decomposition of skills, and latent-thought scaling dimensions (2025).

Anchor papers (verify; mind their dates):
• arXiv:2012.05208 (2020) — binding problem framing
• arXiv:2305.18654 (2023) — memorization vs. true composition
• arXiv:2501.06252 (2025) — self-adaptive expert composition
• arXiv:2603.23004 (2026) — constraint-satisfaction plateau

Your task:
(1) RE-TEST THE COVERAGE/MODULARITY THESIS. Does it still hold that scaling primarily works by filling combination-space coverage? Probe whether newer models (Gemini 2.5, o1 variants, any 2025–2026 releases) show qualitative compositional leaps beyond memorization, or whether they too hit the subgraph wall. Separate the durable finding ('more data→cleaner modules') from any perishable constraint ('scaling is necessary').
(2) Surface the strongest CONTRADICTING or SUPERSEDING work from the last 6 months. Specifically hunt for papers that claim scaling *dissolves* the binding problem, or papers showing in-context composition without parameter scaling, or evidence that latent-thought or expert-mixture methods actually outperform naive scaling on held-out novel compositions.
(3) Propose 2 research questions assuming the regime has moved: one assuming compositional generalization is now *solved* by new training methods or architectures (what's the new frontier?), one assuming the 55–60% plateau is real (what structural bottleneck actually limits binding, and can you design around it?).

Cite arXiv IDs; flag anything you cannot ground in a real paper.

Does making an AI bigger teach it to combine ideas it knows into new arrangements — or just memorize more examples?

Related lines of inquiry

Sources 9 notes

Papers this line draws on 8