What role does query-level exposure play in enabling compositional generalization?
This explores whether compositional generalization — recombining known pieces into novel wholes — actually depends on having seen the relevant combinations during training, rather than learning genuine systematic rules.
This reads 'query-level exposure' as the degree to which a model has actually encountered specific combinations of task pieces during training — and the corpus suggests this exposure does far more work than the phrase 'generalization' implies. The most direct evidence is that plain neural networks achieve compositional generalization through data and model scaling alone, with no special architecture, but only when the training distribution sufficiently covers the combinations of task modules in question Can neural networks learn compositional skills without symbolic mechanisms?. In other words, what looks like the ability to compose is often the ability to interpolate over a well-covered space. Exposure isn't a helper; it's the precondition.
The sharpest contrast comes from work showing what happens when exposure runs out. Transformers don't appear to learn systematic rules at all — they reduce compositional reasoning to matching against linearized computation subgraphs memorized from training, succeeding in-distribution and failing drastically on genuinely novel compositions, with errors compounding across steps Do transformers actually learn systematic compositional reasoning?. Put the two notes side by side and a picture emerges: the model isn't reasoning compositionally so much as recognizing whether a given query falls inside or outside the territory it was exposed to. The boundary of exposure is the boundary of competence.
What makes exposure usable, rather than just memorized, seems to be internal structure. Networks naturally carve compositional tasks into isolated modular subnetworks — ablate one and only its corresponding function degrades — and pretraining sharpens this modularity, making it more consistent across architectures Do neural networks naturally learn modular compositional structure?. The scaling note adds a tell: success is predicted by whether the constituent pieces are linearly decodable from hidden activations Can neural networks learn compositional skills without symbolic mechanisms?. So exposure works when it leaves behind cleanly separable parts that can be re-bound — and the failure to re-bind learned structure into novel combinations is exactly what the binding problem names as the root cause of systematic generalization failure Why do neural networks fail at compositional generalization?.
The surprising twist is how little exposure it can take. Keyword priming after a gradient update is predictable from a token's pre-learning probability, and just three training exposures suffice to lock in the effect above a sharp threshold Can we predict keyword priming before learning happens?. That reframes 'query-level exposure' as something closer to a switch than a dosage — a few hits in the right place can move a combination from the fail side of the line to the pass side. But exposure cuts both ways: training a dense retriever to be more sensitive to compositional structure reliably degrades its zero-shot generalization, an 8–40% drop that's a geometric trade-off in the embedding space rather than a tuning bug Does training for compositional sensitivity hurt dense retrieval?. Optimizing hard for compositional discrimination can cost you the broad coverage that made generalization possible in the first place.
The thing worth carrying away: across these notes, 'compositional generalization' looks less like a capability the model has and more like a reflection of where its training exposure reached. Coverage of the combination space, clean modular decomposition of what was covered, and a low exposure threshold to flip a query into the 'seen enough' regime do most of the explaining — which is why scaling the data can substitute for clever architecture, and why the real frontier is the queries that sit just past the edge of what was ever shown.
Sources 6 notes
Standard MLPs achieve compositional generalization through data and model scaling alone, without architectural modifications, provided the training distribution sufficiently covers combinations of task modules. Linear decodability of constituents from hidden activations reliably predicts success.
Research shows transformers succeed on in-distribution tasks by memorizing computation subgraphs from training data, not by learning systematic rules. They fail drastically on novel compositions, with errors compounding across reasoning steps.
Pruning experiments reveal that neural networks implement compositional subroutines in isolated subnetworks, with ablations affecting only their corresponding function. Pretraining substantially increases the consistency and reliability of this modular structure across architectures and domains.
Greff et al. argue that neural networks cannot dynamically bind distributed information into compositional structures due to three failures: segregating entities from inputs, maintaining representational separation, and reusing learned structure in novel combinations. Scaling can partially overcome this by enabling compositional representations to emerge.
Pre-learning keyword probability strongly predicts post-learning priming across architectures and model sizes, with a ~10^-3 threshold separating contexts where priming occurs from those where it doesn't. Just 3 training exposures suffice to establish the effect.
Adding structure-targeted negatives to dense retrieval training consistently degrades zero-shot performance (8-40% nDCG@10 drop) while only partially improving compositional discrimination. This is a geometric trade-off in high-dimensional cosine spaces, not a tuning problem.