INQUIRING LINE

Inquiring lines›How do language models construct a…›How does AI persuasion undermine h…›Does model scaling alone produce c…›this inquiring line

When AI combines ideas it's never seen before, is it reasoning from rules — or just filling gaps in a dense training set?

What role does query-level exposure play in enabling compositional generalization?

This explores whether compositional generalization — recombining known pieces into novel wholes — actually depends on having seen the relevant combinations during training, rather than learning genuine systematic rules.

This reads 'query-level exposure' as the degree to which a model has actually encountered specific combinations of task pieces during training — and the corpus suggests this exposure does far more work than the phrase 'generalization' implies. The most direct evidence is that plain neural networks achieve compositional generalization through data and model scaling alone, with no special architecture, but only when the training distribution sufficiently covers the combinations of task modules in question Can neural networks learn compositional skills without symbolic mechanisms?. In other words, what looks like the ability to compose is often the ability to interpolate over a well-covered space. Exposure isn't a helper; it's the precondition.

The sharpest contrast comes from work showing what happens when exposure runs out. Transformers don't appear to learn systematic rules at all — they reduce compositional reasoning to matching against linearized computation subgraphs memorized from training, succeeding in-distribution and failing drastically on genuinely novel compositions, with errors compounding across steps Do transformers actually learn systematic compositional reasoning?. Put the two notes side by side and a picture emerges: the model isn't reasoning compositionally so much as recognizing whether a given query falls inside or outside the territory it was exposed to. The boundary of exposure is the boundary of competence.

What makes exposure usable, rather than just memorized, seems to be internal structure. Networks naturally carve compositional tasks into isolated modular subnetworks — ablate one and only its corresponding function degrades — and pretraining sharpens this modularity, making it more consistent across architectures Do neural networks naturally learn modular compositional structure?. The scaling note adds a tell: success is predicted by whether the constituent pieces are linearly decodable from hidden activations Can neural networks learn compositional skills without symbolic mechanisms?. So exposure works when it leaves behind cleanly separable parts that can be re-bound — and the failure to re-bind learned structure into novel combinations is exactly what the binding problem names as the root cause of systematic generalization failure Why do neural networks fail at compositional generalization?.

The surprising twist is how little exposure it can take. Keyword priming after a gradient update is predictable from a token's pre-learning probability, and just three training exposures suffice to lock in the effect above a sharp threshold Can we predict keyword priming before learning happens?. That reframes 'query-level exposure' as something closer to a switch than a dosage — a few hits in the right place can move a combination from the fail side of the line to the pass side. But exposure cuts both ways: training a dense retriever to be more sensitive to compositional structure reliably degrades its zero-shot generalization, an 8–40% drop that's a geometric trade-off in the embedding space rather than a tuning bug Does training for compositional sensitivity hurt dense retrieval?. Optimizing hard for compositional discrimination can cost you the broad coverage that made generalization possible in the first place.

The thing worth carrying away: across these notes, 'compositional generalization' looks less like a capability the model has and more like a reflection of where its training exposure reached. Coverage of the combination space, clean modular decomposition of what was covered, and a low exposure threshold to flip a query into the 'seen enough' regime do most of the explaining — which is why scaling the data can substitute for clever architecture, and why the real frontier is the queries that sit just past the edge of what was ever shown.

Sources 6 notes

Can neural networks learn compositional skills without symbolic mechanisms?

Standard MLPs achieve compositional generalization through data and model scaling alone, without architectural modifications, provided the training distribution sufficiently covers combinations of task modules. Linear decodability of constituents from hidden activations reliably predicts success.

Do transformers actually learn systematic compositional reasoning?

Research shows transformers succeed on in-distribution tasks by memorizing computation subgraphs from training data, not by learning systematic rules. They fail drastically on novel compositions, with errors compounding across reasoning steps.

Do neural networks naturally learn modular compositional structure?

Pruning experiments reveal that neural networks implement compositional subroutines in isolated subnetworks, with ablations affecting only their corresponding function. Pretraining substantially increases the consistency and reliability of this modular structure across architectures and domains.

Why do neural networks fail at compositional generalization?

Greff et al. argue that neural networks cannot dynamically bind distributed information into compositional structures due to three failures: segregating entities from inputs, maintaining representational separation, and reusing learned structure in novel combinations. Scaling can partially overcome this by enabling compositional representations to emerge.

Can we predict keyword priming before learning happens?

Pre-learning keyword probability strongly predicts post-learning priming across architectures and model sizes, with a ~10^-3 threshold separating contexts where priming occurs from those where it doesn't. Just 3 training exposures suffice to establish the effect.

Show all 6 sources

Does training for compositional sensitivity hurt dense retrieval?

Adding structure-targeted negatives to dense retrieval training consistently degrades zero-shot performance (8-40% nDCG@10 drop) while only partially improving compositional discrimination. This is a geometric trade-off in high-dimensional cosine spaces, not a tuning problem.

Papers this line draws on 8

The research behind the notes this line reads — ranked by how closely each paper relates.

Scaling can lead to compositional generalization3.59 match · arxiv ↗
Break It Down: Evidence for Structural Compositionality in Neural Networks3.56 match · arxiv ↗
How do Transformers Learn Implicit Reasoning?3.30 match · arxiv ↗
From Frege to chatGPT: Compositionality in language, cognition, and deep neural networks2.60 match · arxiv ↗
Faith and Fate: Limits of Transformers on Compositionality2.58 match · arxiv ↗
How new data permeates LLM knowledge and how to dilute it1.68 match · arxiv ↗
Bigger is not always better: The importance of human-scale language modeling for psycholinguistics1.62 match · arxiv ↗
Skills-in-Context Prompting: Unlocking Compositionality in Large Language Models1.60 match · arxiv ↗

Research prompt for your LLMexpand ↓

Copy into ChatGPT or Claude to take this line of inquiry further — it asks the model to find newer work and re-test which earlier constraints still hold.

You are a research analyst re-testing claims about compositional generalization in LLMs. The central question: does query-level exposure (training coverage of task-piece combinations) truly limit compositional generalization, or have newer models, training methods, or evaluation regimes since relaxed this constraint?

What a curated library found — and when (findings span 2016–2026, dated claims not current truth):
• Compositional generalization emerges from data and model scaling alone with no special architecture, conditional on training distribution covering constituent task combinations (~2025).
• Transformers reduce compositional reasoning to memorized linearized computation subgraphs, failing drastically on genuinely novel compositions (~2023).
• Networks naturally decompose compositional tasks into modular subnetworks; pretraining sharpens modularity, and success is predicted by linear decodability of constituent pieces from hidden activations (~2023).
• A sharp threshold exists: just three training exposures suffice to flip a query-combination from fail to pass, predicted from token pre-learning probability (~2024).
• Training dense retrievers for compositional sensitivity causes 8–40% zero-shot generalization drop—a geometric trade-off in embedding space (~2026).

Anchor papers (verify; mind their dates):
• arXiv:2305.18654 (Faith and Fate: Limits of Transformers on Compositionality, 2023)
• arXiv:2507.07207 (Scaling can lead to compositional generalization, 2025)
• arXiv:2604.16351 (Training for Compositional Sensitivity Reduces Dense Retrieval Generalization, 2026)
• arXiv:2012.05208 (On the Binding Problem in Artificial Neural Networks, 2020)

Your task:
(1) RE-TEST EACH CONSTRAINT. For every finding above, judge whether newer models (o1, Claude 3.5+, Llama 3.x), training methods (RL-fine-tuning, scaffold learning, multi-modal pretraining), tooling (code interpreters, dynamic retrieval), orchestration (multi-turn context, tool-use chains), or evaluation have since relaxed or overturned it. Separate the durable question (likely still open) from the perishable limitation. Cite what resolved it; say plainly where constraints still hold.
(2) Surface the strongest CONTRADICTING or SUPERSEDING work from the last ~6 months—especially papers claiming compositional generalization *without* full exposure, or showing emergence of novel composition from in-context learning alone.
(3) Propose two research questions that *assume* the exposure-as-boundary regime may have shifted: e.g., "Can instruction-tuning or prompt-based scaffolding decouple compositional success from pre-training coverage?" or "Does multi-agent orchestration (e.g., tool delegation) allow a single LLM to compose without memorizing the combination space?"

Cite arXiv IDs; flag anything you cannot ground in a real paper.

When AI combines ideas it's never seen before, is it reasoning from rules — or just filling gaps in a dense training set?

Related lines of inquiry

Sources 6 notes

Papers this line draws on 8