INQUIRING LINE

Can scaling alone create compositional generalization without explicit binding mechanisms?

This explores whether neural networks can learn to combine known pieces into novel wholes just by getting bigger and seeing more data — or whether they need special built-in machinery for 'binding' parts together.


This explores whether scale alone — more data, more parameters — can produce compositional generalization (combining known parts into novel wholes), or whether networks need explicit binding mechanisms to do it. The corpus splits sharply on this, and the disagreement is the interesting part. On the optimistic side, plain MLPs achieve compositional generalization with no architectural tricks, as long as the training data covers enough combinations of the underlying task pieces — and you can predict success by checking whether the constituent parts are linearly readable from hidden activations Can neural networks learn compositional skills without symbolic mechanisms?. That finding is reinforced by evidence that networks spontaneously carve compositional tasks into isolated modular subnetworks, where ablating one piece only disturbs its own function — and pretraining makes this modular structure more reliable Do neural networks naturally learn modular compositional structure?. So binding-like structure seems to emerge for free.

But the pessimistic camp says what looks like composition is often memorization in disguise. Transformers frequently 'succeed' by matching linearized subgraphs of computation they saw in training, then fail badly on genuinely novel combinations, with errors compounding step by step Do transformers actually learn systematic compositional reasoning?. The classic framing for why this happens is the binding problem: networks struggle to dynamically bind distributed information into reusable structure — they can't cleanly segregate entities, keep representations separate, and recombine them in new ways Why do neural networks fail at compositional generalization?. Notably, that same note concedes scaling can *partially* overcome the problem by letting compositional representations emerge — which is exactly the tension this question is pointing at.

The most useful reframing in the corpus is that the choice may be false. Networks appear to learn binding-compatible *geometry* without anyone building it in: LLMs encode syntactic relationships in something like polar coordinates, using both distance and angle to represent type and direction of relations — a structured, symbolic-compatible scheme that arose on its own How do language models encode syntactic relations geometrically?. In the same spirit, length generalization transfers across related tasks because models reuse shared attention heads as a kind of reusable computational scaffolding already present after pretraining Can length generalization transfer between different related tasks?. These suggest scale doesn't skip binding — it grows implicit binding machinery.

There's a hard ceiling worth knowing about, though. On genuine constrained-optimization tasks, LLMs plateau at roughly 55–60% regardless of parameter count, architecture, or reasoning training — a wall that looks structural, not like a scaling gap Do larger language models solve constrained optimization better?. That's the strongest evidence against 'scaling alone is enough': when a task requires true systematic recombination under constraints, more scale doesn't move the number. One more wrinkle: architecture still shapes *how* composition happens — deep-and-thin small models compose abstract concepts through stacked layers and beat wider models of the same size, suggesting depth is where compositional structure gets built Does depth matter more than width for tiny language models?.

The honest synthesis: scaling can produce compositional generalization *within the convex hull of what training covered*, and it does grow implicit binding-like structure (modular subnetworks, polar syntactic geometry, reused heads) without anyone hand-coding it. But on truly novel compositions and constraint-heavy tasks, the binding problem reasserts itself as a ceiling. The thing you didn't know you wanted to know: the most predictive test of whether scaling 'worked' isn't accuracy — it's whether the constituent parts have become linearly decodable inside the network's activations.


Sources 8 notes

Can neural networks learn compositional skills without symbolic mechanisms?

Standard MLPs achieve compositional generalization through data and model scaling alone, without architectural modifications, provided the training distribution sufficiently covers combinations of task modules. Linear decodability of constituents from hidden activations reliably predicts success.

Do neural networks naturally learn modular compositional structure?

Pruning experiments reveal that neural networks implement compositional subroutines in isolated subnetworks, with ablations affecting only their corresponding function. Pretraining substantially increases the consistency and reliability of this modular structure across architectures and domains.

Do transformers actually learn systematic compositional reasoning?

Research shows transformers succeed on in-distribution tasks by memorizing computation subgraphs from training data, not by learning systematic rules. They fail drastically on novel compositions, with errors compounding across reasoning steps.

Why do neural networks fail at compositional generalization?

Greff et al. argue that neural networks cannot dynamically bind distributed information into compositional structures due to three failures: segregating entities from inputs, maintaining representational separation, and reusing learned structure in novel combinations. Scaling can partially overcome this by enabling compositional representations to emerge.

How do language models encode syntactic relations geometrically?

The Polar Probe shows LLMs represent syntactic type and direction through both distance and angular position between embeddings, nearly doubling accuracy over distance-only methods. This demonstrates neural networks spontaneously learn structured, symbolic-compatible geometry.

Can length generalization transfer between different related tasks?

Models trained jointly on related tasks reuse the same attention heads to handle length generalization, allowing shorter tasks to extrapolate beyond their training length. Pretrained models already contain this reusable computational scaffolding.

Do larger language models solve constrained optimization better?

Across constrained-optimization tasks, LLMs converge to ~55–60% constraint satisfaction independent of architecture, parameter count, or training regime. Reasoning models do not systematically outperform standard models, suggesting a fundamental ceiling rather than a scaling gap.

Does depth matter more than width for tiny language models?

MobileLLM shows deep-and-thin architectures yield 2.7–4.3% accuracy gains over balanced designs at 125M–350M scale by composing abstract concepts through layers rather than spreading parameters across width.

Research prompt for your LLMexpand ↓

Copy into ChatGPT or Claude to take this line of inquiry further — it asks the model to find newer work and re-test which earlier constraints still hold.

You are a research analyst re-testing claims about compositional generalization in LLMs. The question: does scaling alone produce compositional generalization, or do networks require explicit binding mechanisms?

What a curated library found — and when (dated claims, not current truth):
Findings span 2020–2026; treat as perishable checkpoints:
• Plain MLPs achieve compositional generalization with sufficient data diversity; success correlates with linear readability of constituent parts in hidden activations (~2023).
• Networks spontaneously form modular subnetworks isolating compositional task pieces; ablation shows clean functional segregation (~2023).
• Transformers often succeed via linearized subgraph matching, failing on novel combinations; errors compound step-by-step (~2023).
• LLMs encode syntactic relations using polar coordinates (distance + angle), a binding-compatible geometry that emerged without architectural priors (~2024–2025).
• LLMs plateau at 55–60% on genuine constraint-satisfaction tasks regardless of scale or depth, suggesting a structural ceiling (~2026).
• Depth (not width) drives compositional abstraction in sub-billion models; stacked layers build composition more effectively (~2024).

Anchor papers (verify; mind their dates):
• arXiv:2012.05208 (2020): On the Binding Problem in Artificial Neural Networks
• arXiv:2305.18654 (2023): Faith and Fate: Limits of Transformers on Compositionality
• arXiv:2412.05571 (2024): A polar coordinate system represents syntax in large language models
• arXiv:2603.23004 (2026): Can Large Language Models Reason and Optimize Under Constraints?

Your task:
(1) RE-TEST EACH CONSTRAINT. For every finding above, judge whether newer models (o1, Claude 4, Llama 3+), training methods (post-training, RL-fine-tuning), tooling (structured generation, constrained decoding), or evaluation (controlled compositionality benchmarks) have since relaxed or overturned it. Separate the durable question (likely still open: *what mechanism enables composition?*) from the perishable limitation (possibly resolved: *scale + method X now beats the 55–60% ceiling*). Cite what resolved it; state plainly where constraints still hold.
(2) Surface the strongest CONTRADICTING or SUPERSEDING work from the last 6 months. Does any recent paper show composition scales without the polar-coordinate geometry, or without modularity, or break the optimization plateau?
(3) Propose 2 research questions that ASSUME the regime may have shifted:
   – Question A: If polar-coordinate syntax and modular scaffolding are necessary *conditions* for compositional scaling, how do we test whether they are *sufficient*?
   – Question B: Given the 55–60% ceiling on constrained tasks, is the bottleneck representational (binding geometry insufficient), computational (search/inference bound), or architectural (no mechanism to enforce consistency across recombinations)?

Cite arXiv IDs; flag anything you cannot ground in a real paper.

Next inquiring lines