SYNTHESIS NOTE

Can we train better models on less data?

Can gradient-based influence estimation identify which instruction data actually matters most? The research explores whether selecting small subsets of training data by their similarity to target capabilities might outperform training on everything.

Synthesis note · 2026-02-22 · sourced from Training Fine Tuning

LESS (Low-rank gradiEnt Similarity Search) selects instruction tuning data by estimating each example's influence on a target capability. Given a handful of examples embodying a specific skill (e.g., reasoning), LESS constructs a gradient datastore of low-dimensional features and selects training data whose gradient signatures are most similar to the target examples.

The headline result: training on LESS-selected 5% of the data can often outperform training on the full dataset across diverse downstream tasks. This is not just efficiency — it's a net improvement. The mechanism: mixed instruction tuning datasets contain examples that actively hinder specific capabilities. Since Does training data format shape reasoning strategy more than domain?, the wrong format examples can shift the model's reasoning strategy away from what the target task requires.

Three technical innovations make this practical for LLMs: (1) adaptation to the Adam optimizer (influence formulations traditionally assume SGD), (2) variable-length sequence handling (instruction data varies wildly in length, which derails standard gradient comparisons), and (3) low-rank gradient features that compress the storage and computation to feasible levels.

The transferability finding is striking: smaller models can select useful data for larger models, and models from different families can share data selections. This suggests the gradient-based quality signal captures something about the data's intrinsic fit with a capability — not just its fit with a particular model's current state. The qualitative analysis confirms this: LESS selects data that "goes beyond surface form cues to identify data that exemplifies the necessary reasoning skills."

This connects to the broader pattern that data quality dominates data quantity. Can models improve themselves on tasks without verifiable answers? showed 1000 well-chosen examples can catalyze general self-improvement. Does teacher-refined data always improve student model performance? showed that data needs to match the student. LESS provides the principled mechanism for finding that match.

Inquiring lines that read this note 17

This note is a source for these research framings, grouped by the broader line of inquiry each explores. Scan the bold lines of inquiry; follow any specific question forward.

Can self-supervised signals enable process supervision without human annotation?

Can instruction tuning succeed without explicit task understanding?

What makes weaker teacher models effective for stronger student training?

Does fine-tuning modify underlying model capabilities or only behavioral outputs?

How does example difficulty affect learning efficiency in language models?

What are the consequences of models training on synthetic data?

When does architectural design matter more than raw model capacity?

How much do structural inductive biases matter compared to training data volume?

How do adversarial and manipulative prompts attack reasoning models?

Can membership inference attacks reliably detect training data exposure?

Should GUI agents use structured representations instead of raw pixels?

What makes high-quality GUI instruction data different from general vision data?

How do multi-agent systems achieve genuine cooperation and reasoning?

Can influence estimation identify the most valuable trajectories in agentic training?

Why does finetuning cause catastrophic forgetting of model capabilities?

How do newly learned facts become accessible after gradient updates?

What structural advantages do diffusion language models offer over autoregressive methods?

Can gradient-based control reach properties that autoregressive methods cannot?

Related concepts in this collection 7

This note in its neighbourhood — explore the map, then jump to a related concept in the list below.

Concept map

15 direct connections · 123 in 2-hop network ·medium cluster Open in graph ↗

Can we train better models on less data? Can models improve themselves on tasks without ver… Does teacher-refined data always improve student m… Does training data format shape reasoning strategy… Does self-generated training data improve model le… What makes test-time training actually work in pra… Can careful selection of 78 demos outperform massi… Can careful curation replace massive alignment dat…

Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph

your link semantically near linked from elsewhere

Can models improve themselves on tasks without verifiable answers? Most self-improvement methods require verifiable correctness signals like math or code. Can models improve on open-ended instruction tasks where right answers aren't automatically checkable? And what minimal training is needed to unlock this?
complementary: LESS finds the right 5%, catalyst data shows 1000 examples suffice
Does teacher-refined data always improve student model performance? Explores whether higher-quality training data from teacher models uniformly benefits student models, or if compatibility with the student's current learning state matters for effective instruction.
LESS provides the mechanism for student-aware selection
Does training data format shape reasoning strategy more than domain? What explains why models trained on multiple-choice data reason differently than those trained on free-form text? The research isolates format and domain effects to measure which one matters more.
explains why wrong data hurts: format mismatch shifts reasoning strategy
Does self-generated training data improve model learning? Can models learn more effectively from training data they generate themselves rather than data created by external sources? This explores whether a learner's own restructuring process produces better learning outcomes.
related: data-learner compatibility as the key variable
What makes test-time training actually work in practice? Test-time training achieved striking gains on ARC tasks, but which components are truly essential? This explores what happens when you remove each of the three key ingredients.
LESS provides the principled mechanism for TTT's first required component (task-similar finetuning): gradient-based influence estimation can identify the most relevant subset for the task-similar finetuning stage, making TTT's first component more efficient and less fragile than heuristic data selection
Can careful selection of 78 demos outperform massive training datasets? Does strategic curation of high-quality demonstrations unlock agentic capability more efficiently than scaling training data? LIMI achieved 73.5% on AgencyBench with 78 samples versus 10K+ samples for competing models, suggesting data quality may matter more than quantity.
LIMI's 78-trajectory result is the agentic analog of LESS's finding: strategic curation outperforms volume; LESS provides the mechanism (gradient-based selection) that could identify which agentic trajectories matter most
Can careful curation replace massive alignment datasets? Does fine-tuning a strong pretrained model on 1000 carefully selected examples achieve alignment quality comparable to models trained on vastly larger datasets? This challenges assumptions about data volume in post-training.
LIMA demonstrates the target state (1000 curated examples suffice for alignment); LESS provides the mechanism for reaching that state (gradient-based selection operationalizes what "careful curation" means computationally)

Can we train better models on less data?

Inquiring lines that read this note 17

Related concepts in this collection 7

Related papers in this collection 8

Search by related questions 4