Does bidirectional attention improve language models as universal encoders?
This explores whether removing the left-to-right "causal mask" — so the model can look both forward and backward across a sentence — makes decoder-only LLMs better at producing the embeddings used for search, retrieval, and clustering.
This explores whether removing the left-to-right "causal mask" — so the model can look both forward and backward across a sentence — makes decoder-only LLMs better at producing the embeddings used for search, retrieval, and clustering. The corpus has a direct and surprisingly clean answer: yes. The LLM2Vec work Why do decoder-only models underperform as text encoders? shows that the thing holding decoder-only models back as text encoders was never their size — it was causal attention itself. Because a decoder-only model only ever sees tokens to its left, the representation of an early word never "knows" about the words that follow it, which is exactly backwards from what you want in an encoder, where a good vector should summarize the whole passage. Switch on bidirectional attention, add a short bout of masked prediction and contrastive learning, and these models jump to state-of-the-art on the standard embedding benchmark (MTEB). The bottleneck was architectural, not a matter of scale.
What makes this interesting is *why* causal attention is such a liability — and here the corpus lets you go sideways into the mechanics of attention. One line of work argues that transformers don't store knowledge as retrievable records at all; knowledge lives as a continuous flow of activations that only exists in the act of generation Do transformer models store knowledge or generate it continuously?. A model built to *generate* the next token left-to-right is optimized for performance, not for compressing meaning into a fixed point — which is precisely the job of an encoder. Bidirectional attention is a way of repurposing a generation engine into a summarization engine.
There's a cautionary thread too. Soft attention has a structural bias: it systematically over-weights repeated and context-prominent tokens regardless of whether they're relevant Does transformer attention architecture inherently favor repeated content?. Letting attention see in both directions doesn't automatically fix that bias — it can give prominent-but-irrelevant material more chances to dominate the representation. So "bidirectional" is a genuine improvement for encoding, but it inherits attention's other quirks rather than curing them.
The broader lesson worth carrying away: the field keeps finding that decoder-only LLMs have latent capabilities locked behind the constraints of how they were trained, not how big they are. Depth turns out to matter more than width for small models Does depth matter more than width for tiny language models?; unused sequence space after the end-of-text token can be repurposed to teach self-evaluation Can models learn to evaluate their own work during training?; and here, flipping the attention mask unlocks an entirely different use. "Universal encoder" is less a new model you have to train from scratch and more a setting you can switch on in one you already have.
Sources 5 notes
LLM2Vec's unsupervised 3-step process (bidirectional attention + masked prediction + contrastive learning) achieves SOTA on MTEB. The research shows causal masking, not model size, is the representation bottleneck in decoder-only encoders.
Transformers organize knowledge as flowing activations rather than retrievable archives, mirroring oral cultures where knowledge exists only in performance. This explains why model knowledge is contextual, difficult to edit, and inseparable from generation.
Transformer soft attention systematically over-weights repeated and context-prominent tokens regardless of relevance, creating a positive feedback loop that amplifies opinions and framing before RLHF acts. System 2 Attention—regenerating context to remove irrelevant material—can interrupt this mechanism.
MobileLLM shows deep-and-thin architectures yield 2.7–4.3% accuracy gains over balanced designs at 125M–350M scale by composing abstract concepts through layers rather than spreading parameters across width.
Post-Completion Learning exploits unused sequence space after model output to train self-assessment capabilities during training while maintaining zero inference cost. The model learns to compute its own reward functions, internalizing evaluation rather than relying on external reward models.