Can decoder-only models become effective text encoders with training?
This explores whether models built to generate text left-to-right (decoder-only LLMs like the GPT family) can be retrofitted into good text encoders — systems that produce dense vector representations for search and similarity — and what kind of training it takes.
This explores whether decoder-only LLMs can be turned into strong text encoders through training. The corpus gives a clear yes, but with a sharp diagnosis of *why* they start out bad. The most direct answer comes from work on LLM2Vec Why do decoder-only models underperform as text encoders?: the thing holding these models back as encoders isn't their size or their pretraining — it's the *causal mask*. Because a decoder-only model is only allowed to look leftward at each token, no position ever 'sees' the full sentence, which is exactly what you need for a good whole-text embedding. Flip on bidirectional attention, add a short bout of masked-token prediction and contrastive learning, and the same model jumps to state-of-the-art on standard embedding benchmarks. The surprising takeaway: a capability everyone assumed lived in the weights was actually being suppressed by the attention pattern.
That reframes 'with training' as less about teaching the model new knowledge and more about *unlocking representations it already had*. There's supporting evidence that the raw material is sitting there before attention even runs: analysis of static embeddings shows they already encode rich semantic structure — valence, concreteness, even taboo — functioning like genuine lexical entries Do transformer static embeddings actually encode semantic meaning?. So the encoder conversion isn't building meaning from scratch; it's reorganizing access to signal the network already carries.
The 'with training' part also comes with a caution the corpus surfaces from a different corner. Not all fine-tuning is benign: directly tuning a model's weights can corrupt knowledge stored in its lower layers, which is why decoding-time methods like proxy-tuning preserve pretrained knowledge better Can decoding-time tuning preserve knowledge better than weight fine-tuning?. The lesson for encoder conversion is that *how* you train matters — lightweight, targeted adaptation (as LLM2Vec uses) is more likely to preserve the model's strengths than heavy weight surgery. This connects to how knowledge lives in these models at all: transformer residual streams seem to carry knowledge as continuous *flow* rather than fixed storage Do transformer models store knowledge or generate it continuously?, which helps explain both why representations are extractable and why aggressive retraining can disturb them.
Worth knowing before you get too optimistic: a better encoder is still a *text* encoder, and text has a ceiling. One thread argues text-only models are 'Plato's cave' systems — language strips out the physics, geometry, and causality of the world it describes Are text-only language models fundamentally limited by abstraction? — and a related argument holds that form-only training can't recover grounded meaning at all Can language models learn meaning from text patterns alone?. Converting a decoder into an encoder makes the representations more useful, but it doesn't escape those limits; it just gives you cleaner access to whatever the text already contained.
So the honest synthesis is: yes, decoder-only models become effective encoders with surprisingly little training, because the barrier was architectural (causal masking) not representational — but the gains are about *unlocking and reorganizing* existing signal, the training method has to be gentle enough not to corrupt it, and the resulting encoder inherits the same grounding limits as any text-only system.
Sources 6 notes
LLM2Vec's unsupervised 3-step process (bidirectional attention + masked prediction + contrastive learning) achieves SOTA on MTEB. The research shows causal masking, not model size, is the representation bottleneck in decoder-only encoders.
Clustering analysis of RoBERTa embeddings reveals sensitivity to five psycholinguistic measures including valence, concreteness, iconicity, and taboo. This demonstrates that static embeddings function as genuine lexical entries containing semantic content before self-attention operates.
Proxy-tuning closes 88-91% of the alignment gap while surpassing direct fine-tuning on knowledge tasks by leaving base model weights untouched. Direct fine-tuning corrupts knowledge storage in lower layers, whereas proxy-tuning applies distributional shifts that primarily affect reasoning and style.
Transformers organize knowledge as flowing activations rather than retrievable archives, mirroring oral cultures where knowledge exists only in performance. This explains why model knowledge is contextual, difficult to edit, and inseparable from generation.
Text strips the physics, geometry, and causality present in reality, forcing language models to manipulate symbols without grounding in their source dynamics. This creates predictable failure modes in physical, geometric, and causal reasoning that multimodal training could address.
Bender & Koller argue that meaning requires the relation between expressions and communicative intents. Since LLMs are trained only on form-to-form prediction with no access to shared attention or intent, they cannot reconstruct the meaning that grounds language.