Why do negative item weights matter more than model depth?
This explores a finding from collaborative filtering — that in recommendation models, letting items express *anti-affinity* (negative weights saying "people who like X tend not to like Y") does more work than stacking layers — and asks why structural bias beats raw model capacity.
This explores a finding from collaborative filtering: a simple linear recommender that's allowed to encode what items *repel* each other can beat deep neural models with far more capacity. The headline result comes from ESLER Can a linear model beat deep collaborative filtering?, a single-layer linear autoencoder. Its trick isn't depth — it's a constraint that forbids an item from predicting itself (zero diagonal), which forces every prediction to flow through item-to-item relationships. Crucially, the model learns *negative* weights: signals that one item's presence makes another less likely. Those negative weights carry the information that capacity alone can't manufacture, and that's why the structural bias outperforms a deeper architecture.
The deeper point is that knowing what to suppress is often more valuable than knowing what to amplify, and this shows up across very different corners of the corpus. In reinforcement learning, training on *only* negative samples — suppressing wrong trajectories rather than reinforcing right ones — matches or exceeds full RL pipelines, because positive-only reinforcement collapses diversity by piling probability mass onto a few answers Does negative reinforcement alone outperform full reinforcement learning?. The same logic recurs in ranking with multinomial likelihoods: the win comes from forcing items to *compete* for probability, so lifting one item necessarily pushes others down Why does multinomial likelihood work better for ranking recommendations?. In all three cases, the modeling power lives in the negative space.
This cuts against the instinct that more depth equals more performance — an instinct that *is* sometimes right. For sub-billion-parameter language models, deep-and-thin beats wide-and-shallow precisely because layers compose abstract concepts Does depth matter more than width for tiny language models?. So depth isn't useless; the real lesson is that depth pays off only when the problem genuinely needs hierarchical composition. Collaborative filtering mostly doesn't — what it needs is an accurate map of which items pull together and which push apart, and a linear model with the right constraint captures that directly while a deep model spends its capacity rediscovering it.
There's also a quieter reason structural bias wins: it can correct distortions in the data that more capacity would simply memorize. Recommendation data is poisoned by feedback loops, where a system's past choices shape what it's later trained on; YouTube's ranker needed an explicit position tower to strip out that selection bias, because without the structural correction the model converged on degenerate equilibria that amplified its own history Why do ranking systems need to model selection bias explicitly?. Relatedly, the *shape* of the training objective can either build good representations or quietly erode them — utility-weighted loss sharpens decisions while starving the feature-learning that capacity is supposed to provide Can utility-weighted training loss actually harm model performance?.
The thing worth carrying away: across recommenders, RL, and ranking, the corpus keeps finding that *the right inductive bias is a substitute for parameters, not a complement to them.* A constraint that encodes anti-affinity, a rule that suppresses bad trajectories, a likelihood that makes items compete — each injects structure the model would otherwise have to learn the hard way, and often learn worse. Depth is what you reach for when you've run out of structure to impose.
Sources 6 notes
ESLER, a single-layer linear autoencoder constrained so items cannot predict themselves, outperforms most deep CF models. The constraint forces prediction through item relationships, and negative weights encoding anti-affinity prove essential—structural bias matters more than model capacity.
Training with only negative samples consistently improves Pass@k across the spectrum, often matching full PPO and GRPO. Negative reinforcement suppresses incorrect trajectories while preserving diversity, whereas positive-only reinforcement degrades higher-k performance by concentrating probability mass.
Liang et al. show that switching VAE likelihoods from Gaussian/logistic to multinomial achieves state-of-the-art results because enforced probability competition between items directly aligns training with top-N ranking objectives. Rebalancing KL regularization further improves performance.
MobileLLM shows deep-and-thin architectures yield 2.7–4.3% accuracy gains over balanced designs at 125M–350M scale by composing abstract concepts through layers rather than spreading parameters across width.
YouTube's multi-objective ranker uses MMoE for conflicting objectives and a shallow position tower to remove selection bias from training data. Without both mechanisms, models converge on degenerate equilibria that amplify their own past decisions.
Asymmetric loss functions correctly incentivize choosing but degrade representation learning by reducing gradient signals for substantive feature acquisition. Training with symmetric loss then adjusting predictions post-hoc outperforms direct utility-weighted training on the same utility objective.