INQUIRING LINE

Does compositional generalization emerge suddenly or improve smoothly with scale?

This explores whether the ability to combine known pieces into novel wholes appears as an abrupt jump at some scale threshold, or whether it climbs gradually as you add data and parameters — and what actually drives the transition.


This explores whether compositional generalization is a sudden "emergence" or a smooth climb with scale. The corpus leans toward smooth-but-conditional: the strongest direct evidence comes from work showing that plain MLPs reach compositional generalization through data and model scaling alone — no special architecture required — *provided the training distribution covers enough combinations of the underlying task modules Can neural networks learn compositional skills without symbolic mechanisms?. The interesting wrinkle there is a smooth, measurable signal underneath the behavior: how linearly decodable the individual constituents are from the hidden activations reliably predicts success. So what looks like a capability switching on is better read as a representation gradually becoming cleaner until it crosses a usefulness threshold.

But "smooth with scale" hides a sharp cliff at the distribution edge. Several notes argue that what scaling buys you is often sophisticated memorization rather than genuine rule-learning. Transformers frequently solve compositional tasks by matching linearized computation subgraphs seen in training, and then fail drastically on genuinely novel compositions, with errors compounding step by step Do transformers actually learn systematic compositional reasoning?. Chain-of-thought shows the same shape: fluent and reliable inside the training distribution, then degrading predictably the moment task, length, or format shifts Does chain-of-thought reasoning actually generalize beyond training data?. The takeaway is that smoothness is real *within* the covered space — and the apparent "sudden" failures are really you walking off the edge of what the data covered.

There's also a structural camp that says scale only papers over a deeper problem. The binding problem framing argues neural nets struggle to dynamically bind distributed information into reusable compositional structures, and that scaling only *partially* overcomes this by letting compositional representations emerge rather than by solving the underlying segregation-and-reuse failure Why do neural networks fail at compositional generalization?. Complementing that, pruning studies show networks do spontaneously carve compositional tasks into isolated modular subnetworks — and crucially, *pretraining* makes that modular structure far more consistent and reliable across architectures Do neural networks naturally learn modular compositional structure?. So part of what scale (especially pretraining) contributes is not a magic jump but a steady increase in how cleanly reusable parts get separated.

The reuse angle is where the corpus quietly answers the "sudden vs. smooth" question most concretely. Length generalization — a close cousin of compositional generalization — transfers across related tasks because models reuse the *same* attention heads, and pretrained models already carry this computational scaffolding before they ever see the target task Can length generalization transfer between different related tasks?. That reframes apparent emergence as scaffolding-already-present getting activated, not built from scratch. The same theme shows up in architecture: deep-and-thin small models beat wide ones precisely because depth lets them *compose* abstract concepts through layers rather than spreading them across width Does depth matter more than width for tiny language models? — evidence that compositional ability tracks a specific structural resource (depth), not raw parameter count alone.

The synthesis worth taking away: "sudden vs. smooth" is partly a measurement artifact. Behaviorally, compositional generalization can *look* abrupt because it depends on a coverage threshold and on reusable circuits switching on. Mechanistically, the corpus suggests it improves smoothly — decodable constituents get cleaner, modular subnetworks get more reliable, attention-head scaffolding gets reused — right up until you hit the boundary of what training covered, where it falls off a cliff rather than degrading gracefully. Scale helps mostly by widening that covered region and sharpening reusable structure, not by conjuring systematic rules out of nowhere.


Sources 7 notes

Can neural networks learn compositional skills without symbolic mechanisms?

Standard MLPs achieve compositional generalization through data and model scaling alone, without architectural modifications, provided the training distribution sufficiently covers combinations of task modules. Linear decodability of constituents from hidden activations reliably predicts success.

Do transformers actually learn systematic compositional reasoning?

Research shows transformers succeed on in-distribution tasks by memorizing computation subgraphs from training data, not by learning systematic rules. They fail drastically on novel compositions, with errors compounding across reasoning steps.

Does chain-of-thought reasoning actually generalize beyond training data?

DataAlchemy experiments show CoT fails systematically under distributional shifts in task, length, and format. Models produce fluent but logically inconsistent reasoning — imitating reasoning form without valid underlying logic.

Why do neural networks fail at compositional generalization?

Greff et al. argue that neural networks cannot dynamically bind distributed information into compositional structures due to three failures: segregating entities from inputs, maintaining representational separation, and reusing learned structure in novel combinations. Scaling can partially overcome this by enabling compositional representations to emerge.

Do neural networks naturally learn modular compositional structure?

Pruning experiments reveal that neural networks implement compositional subroutines in isolated subnetworks, with ablations affecting only their corresponding function. Pretraining substantially increases the consistency and reliability of this modular structure across architectures and domains.

Can length generalization transfer between different related tasks?

Models trained jointly on related tasks reuse the same attention heads to handle length generalization, allowing shorter tasks to extrapolate beyond their training length. Pretrained models already contain this reusable computational scaffolding.

Does depth matter more than width for tiny language models?

MobileLLM shows deep-and-thin architectures yield 2.7–4.3% accuracy gains over balanced designs at 125M–350M scale by composing abstract concepts through layers rather than spreading parameters across width.

Research prompt for your LLMexpand ↓

Copy into ChatGPT or Claude to take this line of inquiry further — it asks the model to find newer work and re-test which earlier constraints still hold.

You are a mechanistic AI researcher re-testing claims about compositional generalization at scale. The question remains: does it emerge suddenly or improve smoothly?

What a curated library found — and when (dated claims, not current truth):
Findings span 2016–2025. The library identified five threads:
- Plain MLPs reach compositional generalization via data+model scaling alone, *if* training distribution covers enough module combinations; internally, linearly decodable constituents predict success (2023–2025).
- Transformers often solve compositionality via linearized subgraph matching within the training distribution, then fail sharply on novel compositions; chain-of-thought shows identical distribution-bounded degradation (2023–2025).
- The binding problem suggests scaling only *partially* solves compositional segregation; pretraining reliably carves modular subnetworks across architectures (2020–2023).
- Length generalization transfers via reused attention heads; pretrained models carry this scaffolding before target-task exposure (2025).
- Depth (not width alone) enables compositional abstraction; sub-billion-parameter models show depth-dependent gains (2024–2025).

Anchor papers (verify; mind their dates):
- arXiv:2305.18654 (2023): Faith and Fate — Transformers' compositionality limits.
- arXiv:2012.05208 (2020): The Binding Problem in neural networks.
- arXiv:2506.09251 (2025): Extrapolation by Association — length generalization transfer.
- arXiv:2508.01191 (2025): Chain-of-Thought distribution-boundedness.

Your task:
(1) RE-TEST: For each constraint (MLPs need coverage; CoT degrades off-distribution; modular subnetworks require pretraining; depth beats width), check whether 2025–2026 architectures (mixture-of-experts, mechanistic interpretability tools, new training regimes, test-time scaling like Titans 2501.00663), or evaluation harnesses have *relaxed* or *overturned* it. Separate the durable question ("what is the mechanistic unit of composition?") from perishable limits ("pretraining is necessary"). Cite what resolved it.
(2) Surface the strongest work from the last 6 months that *contradicts* the "smooth-within-coverage" narrative—especially anything showing true compositional rule-learning or abrupt phase transitions.
(3) Propose two research questions assuming the regime has moved: e.g., "Does test-time scaling (Titans) buy compositional generalization outside training coverage?" and "Can mixture-of-experts avoid the binding problem via dynamic routing?"

Cite arXiv IDs; flag anything you cannot ground in a real paper.

Next inquiring lines