INQUIRING LINE

Can encoder-only architectures match decoder-based sequential models for recommendation?

This explores whether bidirectional encoder-only transformers (which see a whole sequence at once) can hold their own against the autoregressive, decoder-style models that predict the next item one step at a time — the dominant paradigm in sequential recommendation.


This explores whether bidirectional encoder-only transformers can match the autoregressive, next-item-prediction models that dominate sequential recommendation — and the corpus says yes, with a clever twist. The most direct evidence is Sequential Masked Modeling, which adapts encoder-only transformers to session-based recommendation using "penultimate-token masking" (masking the second-to-last item so the encoder learns to predict what comes next) plus sliding-window augmentation. The striking result isn't just that it matches other single-session methods — it rivals cross-session recommenders that have access to far richer user history, despite seeing only one session Can single sessions alone rival history-rich recommendation?. The architectural choice (encoder vs. decoder) turns out to matter less than how you frame the prediction task.

That lesson — structure and framing beat raw architectural muscle — echoes loudly across the rest of the corpus. The most pointed example is the EASE/ESLER line of work, where a shallow *linear* item-item model with one trick (forcing the diagonal to zero so an item can't predict itself) beats deep neural autoencoders on most datasets Can simpler models beat deep networks for recommendation systems? Can a linear model beat deep collaborative filtering?. In both cases the win comes from a structural prior — a constraint that forces the model to learn relationships through other items — rather than from more parameters or a fancier sequence model. Encoder-only recommendation succeeds for the same reason: the masking objective is a structural prior that makes a non-autoregressive model behave as if it understood sequence.

There's a second framing worth pulling in: you don't have to pick "encoder vs. decoder" at all. P5 recasts recommendation as text-to-text and trains a single encoder-decoder across five task families, matching task-specific models and transferring zero-shot to new items Can one text encoder unify all recommendation tasks?. So the encoder/decoder distinction partly dissolves once you treat recommendation as language modeling — the interesting question becomes what representation the model consumes, not which half of the transformer you keep.

That shifts attention to representation choices that cut across architecture entirely. VQ-Rec maps item text to discrete codes before embedding, which decouples the recommender from any particular text encoder and transfers across domains better than raw text embeddings Can discretizing text embeddings improve recommendation transfer? Can discrete codes transfer better than text embeddings?. Meanwhile AMP-CF shows that *where* you spend modeling capacity — splitting a user into multiple attention-weighted personas conditioned on the candidate item — buys both accuracy and built-in explanations Can attention mechanisms reveal which user taste explains each recommendation? Can modeling multiple user personas improve recommendation accuracy?.

The thing you didn't know you wanted to know: the field keeps rediscovering that the recommendation "backbone" is surprisingly fungible. Encoder-only can match decoder-only, a linear model can match a deep one, and a text-to-text model can absorb both — because in recommendation the leverage lives in the prior you bake in (a masking objective, a zero-diagonal constraint, candidate-conditioned personas) far more than in the sequence-modeling machinery you happen to choose.


Sources 8 notes

Can single sessions alone rival history-rich recommendation?

Sequential Masked Modeling adapts encoder-only transformers for session-based recommendation using penultimate-token masking and sliding-window augmentation. Across three datasets, this single-session approach consistently outperforms other single-session methods and rivals cross-session approaches with richer user history.

Can simpler models beat deep networks for recommendation systems?

EASE, a shallow linear item-item weight matrix with diagonal constrained to zero, beats deep neural baselines on most datasets. The constraint forces generalization by forbidding self-prediction, while learned negative weights capture item dissimilarity—a structural prior more valuable than model capacity.

Can a linear model beat deep collaborative filtering?

ESLER, a single-layer linear autoencoder constrained so items cannot predict themselves, outperforms most deep CF models. The constraint forces prediction through item relationships, and negative weights encoding anti-affinity prove essential—structural bias matters more than model capacity.

Can one text encoder unify all recommendation tasks?

P5 converts user-item interactions and metadata into natural language and trains a single encoder-decoder across five recommendation task families, matching task-specific models while achieving zero-shot transfer to new items and domains. Unification trades efficiency for composability.

Can discretizing text embeddings improve recommendation transfer?

VQ-Rec uses product quantization to map item text to discrete codes that index learned embeddings, breaking the tight coupling between text and recommendations. This decoupling prevents text-similarity bias and allows lookup tables to adapt to new domains without retraining the text encoder.

Can discrete codes transfer better than text embeddings?

VQ-Rec demonstrates that mapping item text to discrete codes via product quantization, then to embeddings, improves cross-domain transfer compared to direct text encoding. The discrete intermediate reduces text bias and enables efficient per-domain fine-tuning.

Can attention mechanisms reveal which user taste explains each recommendation?

AMP-CF represents each user as multiple latent personas weighted dynamically by candidate item. This makes recommendations both diverse and interpretable—each suggestion traces to the specific persona preference it satisfies—without requiring post-hoc reranking.

Can modeling multiple user personas improve recommendation accuracy?

AMP-CF separates user representation into latent personas weighted by attention to the candidate item. This candidate-conditional approach improves accuracy by adapting the user representation at prediction time and produces inherent explanations for why items were recommended.

Research prompt for your LLMexpand ↓

Copy into ChatGPT or Claude to take this line of inquiry further — it asks the model to find newer work and re-test which earlier constraints still hold.

You are a recommendation systems researcher. The question remains open: can encoder-only architectures match decoder-based sequential models for recommendation, and does architectural choice actually matter?

What a curated library found — and when (dated claims, not current truth):
Findings span 2018–2025. A curated library identified:
- Sequential Masked Modeling (penultimate-token masking) lets encoder-only transformers rival cross-session recommenders on single-session data, matching decoder performance (~2024, arXiv:2410.11150).
- Shallow linear item-item models (EASE, diagonal constraint) beat deep autoencoders on sparse collaborative filtering, showing structure over parameters matters (~2019, arXiv:1905.03375).
- Encoder-decoder unification (P5: text-to-text across five tasks) matches task-specific models and transfers zero-shot (~2022, arXiv:2203.13366).
- Discrete code representations (VQ-Rec) decouple item text from the recommender and transfer across domains better than raw embeddings (~2022, arXiv:2210.12316).
- Multi-persona attention (AMP-CF) allocates capacity per candidate item, improving both accuracy and interpretability (~2020, arXiv:2010.07042).

Anchor papers (verify; mind their dates):
- arXiv:1905.03375 (EASE, 2019)
- arXiv:2203.13366 (P5/RLP, 2022)
- arXiv:2210.12316 (VQ-Rec, 2022)
- arXiv:2410.11150 (encoder-only session rec, 2024)

Your task:
(1) RE-TEST EACH CONSTRAINT. For penultimate-token masking, discrete codes, and multi-persona allocation: have newer LLM-integrated systems (e.g., arXiv:2503.24289 Rec-R1, 2025) or foundation models with longer context windows, better inference optimization, or retrieval-augmented orchestration since relaxed or overturned the architectural parity claim? Separate the durable insight (priors matter more than architecture) from perishable limitations (single-session scope, sparse data regimes). Cite what relaxed it.
(2) Surface the strongest contradicting or superseding work from the last 6 months — papers claiming encoder-only *cannot* match decoders, or showing new architectural constraints emerged.
(3) Propose 2 research questions that assume the regime has moved: (a) do foundation-model-scale recommenders (LLM-based or hybrid) dissolve the encoder/decoder distinction further, and if so, what is the new bottleneck? (b) in multi-task, multi-domain recommendation with LLM integration, does discrete or continuous representation dominate?

Cite arXiv IDs; flag anything you cannot ground in a real paper.

Next inquiring lines