INQUIRING LINE

Can linear bandit methods scale beyond their original reward assumptions?

This explores whether linear contextual bandits—algorithms like LinUCB that assume reward is a linear function of context features—still hold up when you push past that linearity assumption, scale them to huge recommendation systems, or relax their exploration machinery.


This explores whether linear contextual bandits can survive outside their original comfort zone: the assumption that reward is a tidy linear function of the context you observe. The corpus suggests the answer is a qualified yes—the linear core stays useful, but each direction you push it (richer reward shapes, larger scale, cheaper exploration) trades one assumption for a different one rather than getting something for free.

The linear starting point is well-established: framing news recommendation as a contextual bandit with LinUCB lets a system explicitly weigh trying uncertain articles against exploiting proven ones, and it beats collaborative filtering on dynamic content with provable regret bounds and low overhead Can bandit algorithms beat collaborative filtering for news?. The first way to scale beyond pure linearity is to keep the linear reward idea but make the coefficients personal: PReF learns base reward functions, then treats each user as a linear combination of them, recovering someone's preferences from about ten well-chosen questions Can user preferences be learned from just ten questions?. That's still linear in spirit—but the linearity is now in a learned feature basis, which is a meaningful loosening of the original raw-feature assumption.

The sharper break is replacing the linear model entirely with a neural one while keeping the bandit's exploration logic. Epistemic neural networks do exactly this: they separate irreducible noise from genuine model uncertainty so Thompson sampling can run at recommendation scale, lifting click-through 9% and ratings 6% while needing 29% fewer interactions Can neural networks explore efficiently at recommendation scale?. So the reward function no longer has to be linear at all—but you inherit a new burden: correctly estimating where the network is uncertain, which is the hard part the linear case gave you almost for free.

The most surprising result runs the other way—sometimes you can drop the bandit's signature exploration machinery and still win. Pure greedy exploitation matches UCB-style regret guarantees when the incoming context distribution is naturally diverse enough that users themselves supply the randomization When can greedy bandits skip exploration entirely?. That reframes the whole question: the 'reward assumptions' you most need aren't always about the reward's functional form, but about whether the world hands you enough variety to learn without deliberately probing it.

The through-line across the corpus is that 'scaling beyond' a bandit's assumptions almost always means swapping a stated assumption for an implicit one—linear reward becomes learned-basis reward, hand-built features become neural uncertainty estimates, active exploration becomes a bet on context diversity. If you want to see where bandit-style reward signals travel even further afield, the corpus also shows recommendation metrics being repurposed as black-box RL rewards for training language models directly Can recommendation metrics train language models directly?—the same instinct of treating a noisy, real-world signal as a learnable reward, carried into a very different system.


Sources 5 notes

Can bandit algorithms beat collaborative filtering for news?

LinUCB frames news recommendation as a contextual bandit problem, explicitly balancing exploration of uncertain articles against exploitation of proven ones. The approach handles dynamic content and cold-start users better than traditional CF, with proven regret bounds and lower computational overhead.

Can user preferences be learned from just ten questions?

PReF learns base reward functions from preference data, then uses active learning to select maximally informative questions that reduce coefficient uncertainty. Users can be personalized via inference-time reward alignment without weight modification.

Can neural networks explore efficiently at recommendation scale?

ENR separates aleatoric from epistemic uncertainty, focusing computation only on parameter uncertainty needed for Thompson sampling. It improved click-through rates 9% and ratings 6% while requiring 29% fewer interactions than baselines.

When can greedy bandits skip exploration entirely?

Contextual bandits using pure greedy exploitation can match UCB-style regret guarantees when the context distribution satisfies covariate diversity—a condition satisfied by many real continuous and discrete distributions where incoming users themselves provide sufficient randomization.

Can recommendation metrics train language models directly?

Rec-R1 demonstrates that LLMs can be trained directly on rule-based recommendation metrics like NDCG and Recall as RL reward signals, eliminating the need for SFT distillation from proprietary models while remaining model-agnostic across different retriever architectures.

Research prompt for your LLMexpand ↓

Copy into ChatGPT or Claude to take this line of inquiry further — it asks the model to find newer work and re-test which earlier constraints still hold.

You are a bandit-algorithm researcher evaluating whether linear contextual bandits have truly scaled beyond their original assumptions, or whether newer work has simply traded one constraint for another. The question remains: what's the fundamental limit?

What a curated library found — and when (dated claims, not current truth):
Findings span 2010–2026, with recent acceleration:
• LinUCB on news recommendation beats collaborative filtering with provable regret bounds, treating reward as linear in observed context (~2010).
• Pure greedy exploitation matches UCB regret when incoming context distribution is naturally diverse enough—exploration becomes optional if the world randomizes (~2017).
• Epistemic neural networks lift click-through 9%, ratings 6%, needing 29% fewer interactions by separating aleatoric noise from epistemic uncertainty in bandit exploration (~2023).
• Reward factorization learns base reward functions, then treats each user as a linear combination—linearity persists in learned feature basis (~2025).
• Test-time RL and latent reasoning depth suggest bandit-style reward signals now travel into LLM reasoning, constraint generation, and multi-step optimization (~2025–2026).

Anchor papers (verify; mind their dates):
• arXiv:1003.0146 (2010) — foundational LinUCB for contextual bandits.
• arXiv:1704.09011 (2017) — exploration-free bandits under natural context diversity.
• arXiv:2306.14834 (2023) — scalable neural contextual bandits via epistemic uncertainty.
• arXiv:2503.06358 (2025) — reward factorization for LLM personalization.

Your task:
(1) RE-TEST EACH CONSTRAINT. For the 'linearity assumption,' has it truly dissolved in neural methods, or has the hard problem just shifted to: estimating neural model uncertainty accurately at scale, and accounting for distribution shift in learned reward bases? Separately, has 'naturally diverse context' become a more or less reliable substitute for deliberate exploration as models scale and recommendation systems become adversarial?
(2) Surface the strongest recent work that contradicts the claim that you always trade one assumption for another. Does anything genuinely drop a constraint without introducing a new one?
(3) Propose 2 durable research questions: (a) As reward signals migrate from recommendation into LLM reasoning (test-time RL, latent depth), do epistemic-bandit principles still hold, or do they collapse under the complexity of constraining reasoning? (b) Can meta-learning or online adaptation of uncertainty estimates reduce the 'assumption-swapping' tax?

Cite arXiv IDs; flag anything you cannot ground in a real paper.

Next inquiring lines