INQUIRING LINE

Why do cross-product features fail to generalize across unseen feature combinations?

This explores why hand-crafted feature crosses (the "wide" part of recommender models) memorize what they've seen but break on combinations they haven't — and what the corpus says about the deeper reason: memorization and generalization are structurally different jobs.


This explores why cross-product features fail on unseen combinations — and the cleanest answer in the corpus is that a cross-product feature *is* memorization by construction, not a bug in it. The Wide & Deep work Can one model memorize and generalize better than two? makes this explicit: a cross-product term like "installed app A AND impression of app B" only fires for pairs the model has actually observed. It's essentially a giant lookup table. When a new pair shows up, there's no entry, so the wide tower contributes nothing. That's not a failure of training — it's the inductive bias of the feature itself. The whole reason these models pair a wide tower with a deep embedding tower Can one model handle both memorization and generalization? is that embeddings can interpolate to unseen combinations by placing items in a shared continuous space, while the cross-products patch the rare, specific cases embeddings smooth over. Each half exists precisely because the other can't do its job.

What makes this an *Inquiring Line* rather than a recommender-systems footnote is that the same failure appears under completely different names across the collection. The binding problem Why do neural networks fail at compositional generalization? is the general version: neural networks struggle to recombine learned pieces into novel arrangements because they don't cleanly segregate and re-bind components. A cross-product feature is the extreme, brittle endpoint of that — it doesn't even attempt to decompose the combination into reusable parts, so it can never recompose them. Generalization to unseen combinations requires representing A and B *separately* and composing them on the fly; memorizing "A×B" as one atom forecloses that entirely.

The corpus also explains *why the smooth, learned alternative isn't a free win*. You might expect a flexible neural similarity to just learn the right combinations, but the dot-product-vs-MLP results Can MLPs learn to match dot product similarity in practice? Why does dot product beat MLP-based similarity in practice? show that expressiveness isn't the bottleneck — inductive bias is. An MLP that is theoretically capable of learning a dot product still needs enormous data and capacity to approximate it, and even then can't be retrieved efficiently at scale. The lesson cuts both ways: the right *structure* (geometric interpolation) generalizes cheaply, while raw flexibility generalizes expensively or not at all. Cross-products sit at the opposite pole — maximal structure, zero interpolation.

There's a subtler thread worth pulling. Even when a model scores perfectly, its internal organization can be "fractured" Can models be smart without organized internal structure? — all the features linearly present, but glued together in a way that shatters under distribution shift. Unseen feature combinations *are* a distribution shift, and that's exactly where this fragility surfaces. The same brittleness shows up in reasoning: chain-of-thought degrades predictably the moment you move off the training distribution Does chain-of-thought reasoning actually generalize beyond training data?, imitating the form of generalization without the substance. The pattern repeats — what looks like competence is often memorized coverage of seen cases.

So the thing you might not have known you wanted to know: the cure for cross-product brittleness isn't a better cross-product. It's deciding which combinations genuinely need memorizing (rare, high-value, idiosyncratic pairs) and letting a structured continuous representation handle the rest — which is exactly the division of labor Wide & Deep formalized, and the binding problem explains in principle.


Sources 7 notes

Can one model memorize and generalize better than two?

Wide & Deep models train memorization (cross-product features) and generalization (embeddings) together, allowing each component to specialize: the wide part becomes small because deep handles common cases, and deep doesn't overfit rare items because wide captures them. Ensembling requires both halves full-size.

Can one model handle both memorization and generalization?

Wide & Deep architectures train a sparse cross-product tower and a dense embedding tower together, allowing the wide part to patch only the deep part's weaknesses. This joint approach requires smaller models than ensemble methods.

Why do neural networks fail at compositional generalization?

Greff et al. argue that neural networks cannot dynamically bind distributed information into compositional structures due to three failures: segregating entities from inputs, maintaining representational separation, and reusing learned structure in novel combinations. Scaling can partially overcome this by enabling compositional representations to emerge.

Can MLPs learn to match dot product similarity in practice?

Rendle et al. show that carefully tuned dot products substantially outperform learned MLP similarities in collaborative filtering. MLPs require excessive capacity and data to match simple geometric similarity, and they cannot be efficiently retrieved at scale—proving inductive bias matters more than expressiveness.

Why does dot product beat MLP-based similarity in practice?

Rendle et al. show properly-tuned dot products substantially beat MLP-based similarity despite MLP universality. Learning a dot product with an MLP requires large models and datasets; dot products also enable efficient retrieval at production scale through MIPS algorithms.

Can models be smart without organized internal structure?

Models trained with SGD can contain all the linearly decodable features needed for a task while maintaining fundamentally broken internal organization. This makes them vulnerable to perturbation and distribution shift invisible to standard evaluation metrics.

Does chain-of-thought reasoning actually generalize beyond training data?

DataAlchemy experiments show CoT fails systematically under distributional shifts in task, length, and format. Models produce fluent but logically inconsistent reasoning — imitating reasoning form without valid underlying logic.

Research prompt for your LLMexpand ↓

Copy into ChatGPT or Claude to take this line of inquiry further — it asks the model to find newer work and re-test which earlier constraints still hold.

You are a machine learning researcher auditing claims about cross-product feature generalization. The question remains open: *Why do cross-product features fail on unseen feature combinations, and what structural alternatives actually work?*

What a curated library found — and when (findings span 2016–2026; treat as dated claims, not current truth):
• Cross-products are memorization by construction: "A AND B" only fires for observed pairs; embeddings interpolate via continuous space (Wide & Deep, 2016).
• The binding problem explains the general failure: neural networks struggle to segregate and re-bind learned components into novel arrangements (2020).
• MLP-based similarity underperforms dot-product despite being a universal approximator — inductive bias (geometric structure) matters more than raw expressiveness (2020).
• Distribution shift (e.g., unseen feature combinations) fractures even high-performing models whose internal representations are poorly organized (2024).
• Chain-of-thought reasoning degrades predictably off-distribution, imitating competence without substance (2025).

Anchor papers (verify; mind their dates):
• arXiv:1606.07792 — Wide & Deep Learning for Recommender Systems (2016)
• arXiv:2012.05208 — On the Binding Problem in Artificial Neural Networks (2020)
• arXiv:2508.01191 — Is Chain-of-Thought Reasoning of LLMs a Mirage? (2025)
• arXiv:2507.07207 — Scaling can lead to compositional generalization (2025)

Your task:
(1) RE-TEST each constraint. For memorization-by-construction, sparse autoencoders (2024), and post-training RL (2025) — have these relaxed the segregation or re-binding bottleneck? Can modern scaling (2025) or RL fine-tuning compose unseen feature interactions? Separate the durable question (how to represent novel combinations) from perishable limits (old model capacity, training harnesses).
(2) Surface work contradicting the "structure beats flexibility" claim — especially if recent scaling or orchestration (multi-agent, caching, retrieval) has made learned similarity competitive with geometric structure.
(3) Propose two questions assuming the regime has moved: (a) Does compositional scaling (2025) dissolve the need for explicit wide towers? (b) Can retrieval-augmented or RL-tuned models route unseen combinations to learned decompositions dynamically?

Cite arXiv IDs; flag anything you cannot ground in a real paper.

Next inquiring lines