INQUIRING LINE

How much context length can sequential recommenders handle before steering degrades?

This reads the question as asking where the limits are when you steer a sequential recommender with natural language — but the corpus is honest here: it has rich material on *how* steering works, and little on the *context-length-vs-degradation* curve specifically.


This explores how far you can push a sequential recommender with steering input before it stops behaving — and the most direct match in the collection is Mender, which conditions a recommender on natural-language preferences pulled from reviews so users can redirect results at inference time without retraining Can users steer recommendations with natural language at inference?. The key insight there isn't about a token budget at all: steering works because preferences are treated as *runtime inputs* rather than baked-in training targets. That reframes your question — the bottleneck for steering isn't necessarily how long the context window is, but whether the model was built to read preferences as live instructions instead of historical patterns. Traditional recommenders fail the steering test not because they run out of room, but because they have no slot for a preference that wasn't in the training data.

Where the corpus does touch length and scale, it does so from a different angle. The P5 line of work folds all user-item history and metadata into natural-language text fed through a single encoder-decoder, which means the recommender's 'context' is literally how much interaction text you can serialize before efficiency suffers — the note frames this explicitly as trading efficiency for composability Can one text encoder unify all recommendation tasks?. So one real answer to 'how much context' is structural: the more you express recommendation as a language task, the more the cost curves of language models (long-sequence efficiency) become your recommender's cost curves.

There's a second, sneakier failure mode the collection surfaces: degradation from *thin* input rather than too much of it. ERRA shows that when a user's history is sparse, embedded methods lose signal, and the fix is retrieval augmentation plus aspect selection to keep explanations anchored to the actual user Can retrieval enhancement fix explainable recommendations for sparse users?. Read alongside Mender, this suggests steering degrades at both ends — too little user context starves it, and the interesting open question (which this corpus doesn't resolve) is whether long histories eventually drown the steering signal the same way.

For the catalog side of scale, RecLLM is worth a look: it catalogs four retrieval strategies for when the item corpus is too big to fit in any prompt — dual-encoder, direct LLM search, concept-based, and search-API lookup — each tuned to a different size and latency budget How should LLM-based recommenders retrieve from massive item corpora?. That's the corpus's honest answer to 'what happens when context can't hold everything': you stop trying to stuff it in and you retrieve instead. Rec-R1 pushes this further, showing a model can steer toward good recommendations without the catalog in context at all, learning catalog awareness indirectly through reward feedback Can LLMs recommend products without ever seeing the catalog?.

The straight answer: this collection doesn't have a paper that measures a steering-degradation-vs-context-length curve directly. What it gives you instead is the more useful reframing — steering survives on *architecture* (treating preferences as runtime inputs) and *retrieval* (not relying on context to hold everything), and it breaks down from sparse signal as readily as from overload.


Sources 5 notes

Can users steer recommendations with natural language at inference?

Mender conditions sequential recommenders on natural-language preferences extracted from reviews, enabling users to steer recommendations at inference without fine-tuning. This approach succeeds on preference-following tasks where traditional recommenders fail because preferences are runtime inputs, not training targets.

Can one text encoder unify all recommendation tasks?

P5 converts user-item interactions and metadata into natural language and trains a single encoder-decoder across five recommendation task families, matching task-specific models while achieving zero-shot transfer to new items and domains. Unification trades efficiency for composability.

Can retrieval enhancement fix explainable recommendations for sparse users?

ERRA combines model-agnostic review retrieval with personalized aspect selection to address data sparsity that embedded methods cannot solve. Retrieval augmentation provides richer signal when user history is sparse, while aspect personalization ensures explanations match user context rather than generic defaults.

How should LLM-based recommenders retrieve from massive item corpora?

RecLLM identifies four retrieval patterns—dual-encoder, direct LLM search, concept-based, and search-API lookup—each optimized for different corpus sizes, latency budgets, and training constraints. Hybrid approaches mixing multiple strategies likely work best for real systems.

Can LLMs recommend products without ever seeing the catalog?

Rec-R1 experiments show that LLMs trained via RL with recommender metrics as rewards can generate effective product search queries without catalog access. The model learns query refinement indirectly through system feedback, paralleling how humans search without knowing platform inventory.

Research prompt for your LLMexpand ↓

Copy into ChatGPT or Claude to take this line of inquiry further — it asks the model to find newer work and re-test which earlier constraints still hold.

You are a recommender-systems researcher evaluating whether steering degradation under long context remains a real bottleneck in 2025. A curated library from 2018–2025 explored this question and surfaced a reframing: the bottleneck isn't token budget per se, but architectural slot for runtime preferences, plus retrieval strategy for catalogs too large to fit in context.

What a curated library found — and when (dated claims, not current truth):
• Mender (2024–12) shows steering works via runtime preference inputs, not training-time targets — suggesting the real constraint is *architectural*, not context length (~2024-12).
• P5 (2022–03) trades efficiency for composability by serializing user-item history + metadata as text; long-sequence cost curves become the recommender's cost curves (~2022-03).
• ERRA (2023–06) reveals thin user histories degrade steering as readily as thick ones; sparse signal starves the model (~2023-06).
• RecLLM (2023–10) catalogs four retrieval strategies for oversized item corpora; none rely on fitting the whole catalog in prompt context (~2023-10).
• Rec-R1 (2025–03) demonstrates models can steer toward good recommendations without catalog in context at all, learning indirectly via RL feedback (~2025-03).

Anchor papers (verify; mind their dates):
• arXiv:2203.13366 (2022-03) — P5: Unified language-based recommendation.
• arXiv:2306.12657 (2023-06) — ERRA: Aspect-aware retrieval for sparse users.
• arXiv:2412.08604 (2024-12) — Mender: Runtime preference conditioning.
• arXiv:2503.24289 (2025-03) — Rec-R1: RL-driven steering without catalog context.

Your task:
(1) RE-TEST each constraint. For each finding, assess whether newer models (gpt-4o, Claude 3.5, Llama 3.3), training methods (RL fine-tuning, preference learning), tooling (prompt caching, vector stores, agentic retrieval), or evaluation have since relaxed the limits. Separate the durable question (likely: can we steer without seeing the full user history or item catalog?) from perishable limitations (possibly resolved by retrieval or preference-as-RL). State plainly where constraints still hold.
(2) Surface the strongest *contradicting or superseding* work from the last ~6 months—papers showing steering *does* degrade sharply with context, or papers that sidestep the question entirely.
(3) Propose 2 research questions that *assume* the regime may have moved: e.g., do cached preference embeddings + multi-round steering outperform single-shot long-context calls? Can Rec-R1 scale to billion-item catalogs without degradation?

Cite arXiv IDs; flag anything you cannot ground in a real paper.

Next inquiring lines