Can editing hidden representations beat weight updates for finetuning?
Does intervening directly on a frozen model's representations offer a better path to parameter-efficient adaptation than current weight-based methods? This challenges the dominant PEFT paradigm by treating representations as the semantic lever instead.
Parameter-efficient finetuning (PEFT) adapts large models by updating a small number of weights (LoRA and variants). ReFT starts from a different premise drawn from interpretability: representations encode rich semantic information, so editing representations might be more powerful than editing weights. ReFT methods operate on a frozen base model and learn task-specific interventions on hidden representations. Its strong instance, LoReFT (low-rank linear subspace ReFT), is a drop-in PEFT replacement that is 10–50× more parameter-efficient than prior state-of-the-art PEFTs and almost always outperforms them across eight commonsense-reasoning, four arithmetic-reasoning, instruction-following (Alpaca-Eval), and GLUE tasks.
The keeper is the conceptual bridge: interpretability findings (that meaning lives in representations as directions/subspaces) become an adaptation method — intervene in the representation subspace rather than perturb weights. This unifies steering and finetuning: the same handle used to interpret a model can be used to adapt it.
This connects the vault's PEFT and mechinterp threads. It operationalizes the linear-representation premise behind Can dictionary learning scale to production language models? (features as steerable directions) as a finetuning technique, and it rhymes with Does reinforcement learning update only a small fraction of parameters?: adaptation concentrates in a low-dimensional subspace, whether of weights or representations.
Inquiring lines that use this note as a source 15
This note is a source for these synthesized inquiries. Follow a line forward into its question, or open it to trace back to all of its sources.
- Why does the right structural prior matter more than raw model capacity?
- Why do unified models still inherit data-distribution biases from training?
- What causes overfitting when forcing new facts into model weights?
- Why does parameter-efficient tuning scaling fail to improve finetuning performance?
- Does pretraining data size matter less than base model scale for finetuning?
- Which finetuning method works best across different task and data regimes?
- How do finetuning and pretraining improvements differ in their effects on model capabilities?
- Can we unlearn memorized text by finetuning only high-gradient weights?
- How do newly learned facts become accessible after gradient updates?
- How do semantic features in representations become steerable task-specific directions?
- Why does adaptation concentrate in low-dimensional subspaces of weights or representations?
- What makes representation interventions more efficient than weight perturbations for finetuning?
- How much performance is lost when converting pretrained checkpoints versus training from scratch?
- Do scaling laws change when weight precision becomes a design variable?
- Does finetuning facts into weights overwrite existing model capabilities?
Related concepts in this collection 3
This note in its neighbourhood — explore the map, then jump to a related concept in the list below.
Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph
-
Can dictionary learning scale to production language models?
Sparse autoencoders recovered interpretable features from toy models, but scaling to real production systems like Claude remains uncertain. This matters because interpretability at scale is foundational for AI safety work.
ReFT turns the interpret-via-directions premise into an adaptation method
-
Does reinforcement learning update only a small fraction of parameters?
Investigating whether RL algorithms consistently modify only 5–30% of model parameters across different LLMs and RL methods, and what structural properties those sparse updates possess.
both find adaptation lives in a low-dimensional subspace
-
Can we trigger reasoning without explicit chain-of-thought prompts?
This research asks whether models possess latent reasoning capabilities that can be activated through direct feature steering, independent of chain-of-thought instructions. Understanding this matters for making reasoning more efficient and controllable.
representation intervention as a capability lever, here generalized to task finetuning
Related papers in this collection 8
Papers most semantically related to this note, ranked by cosine similarity in the embedding space.
- ReFT: Representation Finetuning for Language Models
- Tina: Tiny Reasoning Models via LoRA
- Context-PEFT: Efficient Multi-Modal, Multi-Task Fine-Tuning
- Farther the Shift, Sparser the Representation: Analyzing OOD Mechanisms in LLMs
- Towards Principled Evaluations of Sparse Autoencoders for Interpretability and Control
- The Unreasonable Ineffectiveness of the Deeper Layers
- Lottery Ticket Adaptation: Mitigating Destructive Interference in LLMs
- Learning, Fast and Slow: Towards LLMs That Adapt Continually
Original note title
representation finetuning intervenes on frozen hidden representations instead of weights and is far more parameter-efficient than LoRA