INQUIRING LINE

Does scaling model size solve compositional generalization problems?

This explores whether simply making models bigger fixes their trouble with compositional generalization — combining known pieces in new ways — and the corpus answer is a conditional 'it depends on what's being scaled, and what kind of novelty you throw at it.'


This question asks whether throwing more parameters at a model dissolves its failure to recombine known pieces into novel wholes. The corpus splits sharply, and the split is the interesting part: scaling helps when the training distribution already covers the combinations you'll test, and stalls when it doesn't.

The optimistic case is real but narrow. One line of work shows plain neural networks achieve compositional generalization through data and model scaling alone, with no symbolic machinery — provided the training data sufficiently covers the combinations of task modules Can neural networks learn compositional skills without symbolic mechanisms?. There's even a mechanistic reason this can work: networks naturally carve compositional tasks into isolated modular subnetworks, and pretraining makes that modular structure more consistent Do neural networks naturally learn modular compositional structure?. The classic 'binding problem' analysis agrees that scale can *partially* overcome systematic-generalization failure by letting compositional representations emerge Why do neural networks fail at compositional generalization?. So scaling isn't useless here — it's a way of buying coverage.

But 'buying coverage' is exactly the trap. A second cluster of papers shows that what looks like composition is often memorization of training-time computation subgraphs: transformers succeed in-distribution by matching linearized subgraphs and then fail drastically on genuinely novel compositions, with errors compounding across steps Do transformers actually learn systematic compositional reasoning?. The sharpest reframing comes from work showing reasoning breaks at *instance novelty*, not task complexity — models fit instance-based patterns rather than general algorithms, so a long reasoning chain succeeds if something similar was in training and collapses if not Do language models fail at reasoning due to complexity or novelty?. If the failure is novelty rather than size, more parameters don't address the actual bottleneck.

The hardest evidence against the scaling cure is where scale visibly flatlines. On genuine constrained-optimization tasks, LLMs plateau at roughly 55–60% constraint satisfaction regardless of parameter count, architecture, or training regime — a ceiling, not a gap a bigger model would close Do larger language models solve constrained optimization better?. Relatedly, models can't actually execute iterative numerical procedures; they recognize a problem as template-similar and emit plausible wrong answers, and this persists across scale Do large language models actually perform iterative optimization?. And even with the right information in front of them, strong parametric priors override in-context evidence — a failure that prompting alone can't fix Why do language models ignore information in their context?.

The lateral surprise here: several papers suggest the productive move isn't *bigger* but *differently shaped*. Depth beats width for small models because composing abstract concepts through layers matters more than spreading parameters across width Does depth matter more than width for tiny language models?. Latent-thought models open scaling dimensions that have nothing to do with parameter count Can latent thought vectors scale language models beyond parameters?, and small models trained on explicit negative examples can match large ones on structured tasks Can small models match large models on function calling?. The collective verdict: scaling moves the boundary of what's been *seen*, but compositional generalization is precisely the demand to go *beyond* what's been seen — so scale postpones the wall rather than removing it.


Sources 11 notes

Can neural networks learn compositional skills without symbolic mechanisms?

Standard MLPs achieve compositional generalization through data and model scaling alone, without architectural modifications, provided the training distribution sufficiently covers combinations of task modules. Linear decodability of constituents from hidden activations reliably predicts success.

Do neural networks naturally learn modular compositional structure?

Pruning experiments reveal that neural networks implement compositional subroutines in isolated subnetworks, with ablations affecting only their corresponding function. Pretraining substantially increases the consistency and reliability of this modular structure across architectures and domains.

Why do neural networks fail at compositional generalization?

Greff et al. argue that neural networks cannot dynamically bind distributed information into compositional structures due to three failures: segregating entities from inputs, maintaining representational separation, and reusing learned structure in novel combinations. Scaling can partially overcome this by enabling compositional representations to emerge.

Do transformers actually learn systematic compositional reasoning?

Research shows transformers succeed on in-distribution tasks by memorizing computation subgraphs from training data, not by learning systematic rules. They fail drastically on novel compositions, with errors compounding across reasoning steps.

Do language models fail at reasoning due to complexity or novelty?

LRMs don't break at complexity thresholds but at instance-novelty boundaries. Models fit instance-based patterns rather than generalizable algorithms, so any reasoning chain succeeds if trained on similar instances, regardless of length.

Do larger language models solve constrained optimization better?

Across constrained-optimization tasks, LLMs converge to ~55–60% constraint satisfaction independent of architecture, parameter count, or training regime. Reasoning models do not systematically outperform standard models, suggesting a fundamental ceiling rather than a scaling gap.

Do large language models actually perform iterative optimization?

Research shows LLMs cannot perform iterative procedures in latent space. They recognize optimization problems as template-similar and emit plausible-looking but incorrect values, a failure mode that persists across model scale and training approaches.

Why do language models ignore information in their context?

Research demonstrates that LMs generate outputs inconsistent with their context because parametric knowledge from training dominates over in-context information. Textual prompting alone cannot override strong priors; causal intervention in representations is required.

Does depth matter more than width for tiny language models?

MobileLLM shows deep-and-thin architectures yield 2.7–4.3% accuracy gains over balanced designs at 125M–350M scale by composing abstract concepts through layers rather than spreading parameters across width.

Can latent thought vectors scale language models beyond parameters?

Latent-Thought Language Models achieve superior sample and parameter efficiency by coupling fast local variational learning with slow global decoder learning. This dual-rate scheme scales few-shot reasoning across both model and latent size, creating independent scaling dimensions beyond traditional parameter scaling.

Can small models match large models on function calling?

Small models fine-tuned via DPO on correct and incorrect function-calling examples from a large teacher model achieve high accuracy on logical and mathematical tasks. DPO's explicit negative examples directly target the rigid output format failures where SFT alone underperforms.

Research prompt for your LLMexpand ↓

Copy into ChatGPT or Claude to take this line of inquiry further — it asks the model to find newer work and re-test which earlier constraints still hold.

You are a research analyst tracking compositional generalization in LLMs. The question remains open: does scaling model size solve compositional generalization, or does it merely postpone failure?

What a curated library found — and when (dated claims, not current truth):
These findings span 2020–2026, with the sharpest tensions emerging 2023–2025:
• Scaling helps *only* when training data covers the target combinations; on truly novel compositions, larger models still fail, collapsing on instance-level unfamiliarity rather than task complexity (~2024).
• Transformers succeed by pattern-matching linearized subgraphs seen in training, not by executing compositional algorithms; this ceiling (~55–60% constraint satisfaction) persists across scale (~2024–2025).
• Depth-over-width and latent-thought architectures open scaling dimensions orthogonal to parameter count; small models on negative-example training match large ones on structured tasks (~2024–2025).
• Recent work (2025–2026) shows scale *can* push compositional generalization forward under specific conditions, but the mechanism remains tethered to training-distribution coverage, not abstract reasoning capacity.

Anchor papers (verify; mind their dates):
- arXiv:2305.18654 (2023-05): Faith and Fate — transformers' compositional limits.
- arXiv:2410.18890 (2024-10): Small models match large on function-calling reasoning via training design.
- arXiv:2507.07207 (2025-07): Scaling can lead to compositional generalization — conditional claims.
- arXiv:2603.23004 (2026-03): LLM reasoning failures under constraints.

Your task:
(1) **RE-TEST EACH CEILING.** For the 55–60% constraint-satisfaction plateau, instance-novelty collapse, and subgraph-matching bottleneck: has recent tooling (test-time compute, chain-of-thought scaffolding, retrieval augmentation, or mixture-of-experts orchestration) actually *relaxed* these, or merely masked them? Distinguish durable limits (compositional reasoning requires unseen concept combinations) from perishable ones (current architectures don't expose this capability).
(2) **SURFACE STRONGEST CONTRADICTION.** The library shows sharp disagreement between arXiv:2507.07207 (scaling *does* help) and arXiv:2305.18654 + arXiv:2602.06176 (it doesn't). Hunt the last 6 months for work that explains *when* each is true — or evidence one view has been empirically overturned.
(3) **PROPOSE 2 FORWARD QUESTIONS:** (a) If compositionality emerges only at training-data boundaries, what regime would demonstrate *algorithmic* composition independent of coverage? (b) Do latent-thought models or depth-optimized architectures actually escape the novelty bottleneck, or do they shift it invisibly?

Cite arXiv IDs; flag anything you cannot ground in a real paper.

Next inquiring lines