Can we train better models on less data?
Can gradient-based influence estimation identify which instruction data actually matters most? The research explores whether selecting small subsets of training data by their similarity to target capabilities might outperform training on everything.
LESS (Low-rank gradiEnt Similarity Search) selects instruction tuning data by estimating each example's influence on a target capability. Given a handful of examples embodying a specific skill (e.g., reasoning), LESS constructs a gradient datastore of low-dimensional features and selects training data whose gradient signatures are most similar to the target examples.
The headline result: training on LESS-selected 5% of the data can often outperform training on the full dataset across diverse downstream tasks. This is not just efficiency — it's a net improvement. The mechanism: mixed instruction tuning datasets contain examples that actively hinder specific capabilities. Since Does training data format shape reasoning strategy more than domain?, the wrong format examples can shift the model's reasoning strategy away from what the target task requires.
Three technical innovations make this practical for LLMs: (1) adaptation to the Adam optimizer (influence formulations traditionally assume SGD), (2) variable-length sequence handling (instruction data varies wildly in length, which derails standard gradient comparisons), and (3) low-rank gradient features that compress the storage and computation to feasible levels.
The transferability finding is striking: smaller models can select useful data for larger models, and models from different families can share data selections. This suggests the gradient-based quality signal captures something about the data's intrinsic fit with a capability — not just its fit with a particular model's current state. The qualitative analysis confirms this: LESS selects data that "goes beyond surface form cues to identify data that exemplifies the necessary reasoning skills."
This connects to the broader pattern that data quality dominates data quantity. Can models improve themselves on tasks without verifiable answers? showed 1000 well-chosen examples can catalyze general self-improvement. Does teacher-refined data always improve student model performance? showed that data needs to match the student. LESS provides the principled mechanism for finding that match.
Inquiring lines that use this note as a source 15
This note is a source for these synthesized inquiries. Follow a line forward into its question, or open it to trace back to all of its sources.
- Can instruction tuning succeed without explicit task understanding?
- Can gradient-based influence scores beat difficulty metrics for identifying valuable training data?
- Why does mixed instruction data sometimes hurt specific model capabilities?
- Can selecting the right data subset outperform training on everything?
- How does training data distribution determine what models can learn?
- How much do structural inductive biases matter compared to training data volume?
- Can membership inference attacks reliably detect training data exposure?
- Can gradient-based influence estimation make test-time training more efficient?
- How much task-similar finetuning data does test-time training actually need?
- What makes high-quality GUI instruction data different from general vision data?
- Does gradient-based influence estimation identify which alignment examples actually matter most?
- Can influence estimation identify the most valuable trajectories in agentic training?
- Does importance sampling actually recover capabilities lost to hard sample training?
- Can we cheaply estimate which samples are currently most informative?
- How do newly learned facts become accessible after gradient updates?
Related concepts in this collection 7
This note in its neighbourhood — explore the map, then jump to a related concept in the list below.
Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph
-
Can models improve themselves on tasks without verifiable answers?
Most self-improvement methods require verifiable correctness signals like math or code. Can models improve on open-ended instruction tasks where right answers aren't automatically checkable? And what minimal training is needed to unlock this?
complementary: LESS finds the right 5%, catalyst data shows 1000 examples suffice
-
Does teacher-refined data always improve student model performance?
Explores whether higher-quality training data from teacher models uniformly benefits student models, or if compatibility with the student's current learning state matters for effective instruction.
LESS provides the mechanism for student-aware selection
-
Does training data format shape reasoning strategy more than domain?
What explains why models trained on multiple-choice data reason differently than those trained on free-form text? The research isolates format and domain effects to measure which one matters more.
explains why wrong data hurts: format mismatch shifts reasoning strategy
-
Does self-generated training data improve model learning?
Can models learn more effectively from training data they generate themselves rather than data created by external sources? This explores whether a learner's own restructuring process produces better learning outcomes.
related: data-learner compatibility as the key variable
-
What makes test-time training actually work in practice?
Test-time training achieved striking gains on ARC tasks, but which components are truly essential? This explores what happens when you remove each of the three key ingredients.
LESS provides the principled mechanism for TTT's first required component (task-similar finetuning): gradient-based influence estimation can identify the most relevant subset for the task-similar finetuning stage, making TTT's first component more efficient and less fragile than heuristic data selection
-
Can careful selection of 78 demos outperform massive training datasets?
Does strategic curation of high-quality demonstrations unlock agentic capability more efficiently than scaling training data? LIMI achieved 73.5% on AgencyBench with 78 samples versus 10K+ samples for competing models, suggesting data quality may matter more than quantity.
LIMI's 78-trajectory result is the agentic analog of LESS's finding: strategic curation outperforms volume; LESS provides the mechanism (gradient-based selection) that could identify which agentic trajectories matter most
-
Can careful curation replace massive alignment datasets?
Does fine-tuning a strong pretrained model on 1000 carefully selected examples achieve alignment quality comparable to models trained on vastly larger datasets? This challenges assumptions about data volume in post-training.
LIMA demonstrates the target state (1000 curated examples suffice for alignment); LESS provides the mechanism for reaching that state (gradient-based selection operationalizes what "careful curation" means computationally)
Related papers in this collection 8
Papers most semantically related to this note, ranked by cosine similarity in the embedding space.
- LESS: Selecting Influential Data for Targeted Instruction Tuning
- Do Models Really Learn to Follow Instructions? An Empirical Study of Instruction Tuning
- Planted in Pretraining, Swayed by Finetuning: A Case Study on the Origins of Cognitive Biases in LLMs
- Exploring Format Consistency for Instruction Tuning
- Instruction Tuning for Large Language Models: A Survey
- A Survey on Post-training of Large Language Models
- Beyond neural scaling laws: beating power law scaling via data pruning
- Are Emergent Abilities in Large Language Models just In-Context Learning?
Original note title
gradient-based influence estimation identifies 5 percent of instruction data that outperforms training on the full dataset