SYNTHESIS NOTE

Can editing hidden representations beat weight updates for finetuning?

Does intervening directly on a frozen model's representations offer a better path to parameter-efficient adaptation than current weight-based methods? This challenges the dominant PEFT paradigm by treating representations as the semantic lever instead.

Synthesis note · 2026-06-03 · sourced from Training Fine Tuning

Parameter-efficient finetuning (PEFT) adapts large models by updating a small number of weights (LoRA and variants). ReFT starts from a different premise drawn from interpretability: representations encode rich semantic information, so editing representations might be more powerful than editing weights. ReFT methods operate on a frozen base model and learn task-specific interventions on hidden representations. Its strong instance, LoReFT (low-rank linear subspace ReFT), is a drop-in PEFT replacement that is 10–50× more parameter-efficient than prior state-of-the-art PEFTs and almost always outperforms them across eight commonsense-reasoning, four arithmetic-reasoning, instruction-following (Alpaca-Eval), and GLUE tasks.

The keeper is the conceptual bridge: interpretability findings (that meaning lives in representations as directions/subspaces) become an adaptation method — intervene in the representation subspace rather than perturb weights. This unifies steering and finetuning: the same handle used to interpret a model can be used to adapt it.

This connects the vault's PEFT and mechinterp threads. It operationalizes the linear-representation premise behind Can dictionary learning scale to production language models? (features as steerable directions) as a finetuning technique, and it rhymes with Does reinforcement learning update only a small fraction of parameters?: adaptation concentrates in a low-dimensional subspace, whether of weights or representations.

Inquiring lines that read this note 25

This note is a source for these research framings, grouped by the broader line of inquiry each explores. Scan the bold lines of inquiry; follow any specific question forward.

When does architectural design matter more than raw model capacity?

What are the consequences of models training on synthetic data?

Why do unified models still inherit data-distribution biases from training?

Why does finetuning cause catastrophic forgetting of model capabilities?

Does fine-tuning modify underlying model capabilities or only behavioral outputs?

Do language model representations contain causally steerable task-specific features?

How do semantic features in representations become steerable task-specific directions?

Do autonomous architecture discoveries follow predictable scaling laws?

Do scaling laws change when weight precision becomes a design variable?

How can identical external performance mask different internal representations?

Why do internal representations differ when external performance matches?

Do language models learn genuine linguistic structure or just surface patterns?

Can we balance interpretability with the efficiency gains of compressed inter-model communication?

How can AI agents autonomously learn and transfer skills across tasks?

Do weight-space skills lose detail compared to textual skill descriptions?

Do harness improvements transfer across model scales or memorize shortcuts?

What cognitive burdens should move from model parameters into harness infrastructure?

Which computational strategies best support reasoning in language models?

Can a trained decoder replace both search and parameter updates?

How does example difficulty affect learning efficiency in language models?

Can learned priors effectively select and weight ensemble members by inference budget?

Why do semantic similarity and task relevance diverge in vector embeddings?

How does representation-level reranking address residual gaps after decomposition?

Related concepts in this collection 3

This note in its neighbourhood — explore the map, then jump to a related concept in the list below.

Concept map

14 direct connections · 155 in 2-hop network ·dense cluster Open in graph ↗

Can editing hidden representations beat weight u… Can dictionary learning scale to production langua… Does reinforcement learning update only a small fr… Can we trigger reasoning without explicit chain-of…

Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph

your link semantically near linked from elsewhere

Can dictionary learning scale to production language models? Sparse autoencoders recovered interpretable features from toy models, but scaling to real production systems like Claude remains uncertain. This matters because interpretability at scale is foundational for AI safety work.
ReFT turns the interpret-via-directions premise into an adaptation method
Does reinforcement learning update only a small fraction of parameters? Investigating whether RL algorithms consistently modify only 5–30% of model parameters across different LLMs and RL methods, and what structural properties those sparse updates possess.
both find adaptation lives in a low-dimensional subspace
Can we trigger reasoning without explicit chain-of-thought prompts? This research asks whether models possess latent reasoning capabilities that can be activated through direct feature steering, independent of chain-of-thought instructions. Understanding this matters for making reasoning more efficient and controllable.
representation intervention as a capability lever, here generalized to task finetuning

Can editing hidden representations beat weight updates for finetuning?

Inquiring lines that read this note 25

Related concepts in this collection 3

Related papers in this collection 8

Search by related questions 4