INQUIRING LINE

Inquiring lines›How should we train models for cap…›What systematic failures and vulne…›What makes weaker teacher models e…›this inquiring line

Is there a quick, cheap way to spot which lessons a model will actually learn the most from — right now?

Can we cheaply estimate which samples are currently most informative?

This explores whether there's a low-cost way to figure out which training examples (or which questions to ask) will teach a model the most *right now* — given that informativeness keeps shifting as the model learns.

This reads the question two ways at once: which *training samples* are worth learning from, and which *queries* are worth asking — and the corpus suggests both hinge on a single uncomfortable fact: informativeness isn't a property of the sample, it's a relationship between the sample and the model's current state. The sharpest statement of this is that a sample's learning value depends on the interaction between its difficulty and the model's present ability, so the 'productive band' of useful examples drifts during training and any static difficulty score goes stale within a few steps How does model ability change what samples teach?. That's the bad news for cheap estimation: whatever you measure, you have to keep re-measuring.

The good news is that several cheap proxies work surprisingly well. Gradient-based influence estimation uses low-rank gradient features to pick the 5% of instruction data most aligned with a target capability — and training on that slice beats training on everything, partly because the discarded data was actively dragging reasoning in the wrong direction Can we train better models on less data?. A related insight is that you don't always need an external estimator at all: a model's own calibrated token-probability uncertainty is a more reliable 'should I act on this?' signal than elaborate multi-call heuristics, at a fraction of the compute Can simple uncertainty estimates beat complex adaptive retrieval?. The self-knowledge is already there; you just have to read it cheaply.

On the query side, the same logic appears as active selection. Information-gain simulation scores candidate questions by how much their possible answers would shrink uncertainty, picking the genuinely high-value question instead of a generic one How can models select the most informative question to ask?. PReF pushes this to an extreme — ten adaptively chosen questions are enough to pin down a personalized reward, because each is selected to maximally reduce coefficient uncertainty Can user preferences be learned from just ten questions?. Both are 'cheap' precisely because they refuse to ask everything and instead spend the budget where uncertainty is highest. The bandit literature names the underlying tradeoff directly: explore uncertain options, exploit proven ones, and concentrate computation only on the *epistemic* uncertainty that decisions actually turn on rather than irreducible noise Can neural networks explore efficiently at recommendation scale?, Can bandit algorithms beat collaborative filtering for news?.

There's a quieter thread worth pulling: sometimes the cheapest informativeness signal is *local and partial*. Step-level confidence catches a reasoning breakdown that whole-trace averaging hides, letting you discard a bad trace before it even finishes generating Does step-level confidence outperform global averaging for trace filtering?. And a model's own half-formed answer can reveal an information gap the original query never expressed — using the partial response as the next retrieval signal Can a model's partial response guide what to retrieve next?. Informativeness, in other words, can be estimated mid-stream, not just before you start.

The thing you might not have expected: across these papers, 'cheap' and 'better' stop being a tradeoff. Curating 78 demonstrations beats ten thousand Can careful selection of 78 demos outperform massive training datasets?; 5% of data beats 100%; ten questions beat a survey. The corpus keeps finding that aggressive, uncertainty-guided selection isn't a budget compromise — it outperforms abundance, because most samples are noise or actively harmful, and the cost of estimating informativeness is far smaller than the cost of learning from the wrong things.

Sources 10 notes

How does model ability change what samples teach?

A sample's learning value depends on the interaction between its difficulty and the model's current ability, not difficulty alone. The productive band of medium-difficulty problems drifts during training, making static difficulty estimates obsolete within steps.

Can we train better models on less data?

LESS uses low-rank gradient features to select instruction data most similar to target capabilities, and training on the selected 5% consistently outperforms full dataset training. The improvement occurs because mixed datasets contain examples that actively hinder specific skills by shifting reasoning strategy away from task requirements.

Can simple uncertainty estimates beat complex adaptive retrieval?

Calibrated token-probability uncertainty consistently beats multi-call adaptive retrieval on single-hop tasks and matches performance on multi-hop, using a fraction of the LM and retriever calls. The model's self-knowledge proves more reliable than external heuristics for deciding when to retrieve.

How can models select the most informative question to ask?

UoT combines uncertainty-aware scenario simulation with information-gain scoring and reward propagation to identify questions whose possible answers maximally reduce diagnostic uncertainty—providing a principled mechanism for specific, high-value clarification rather than generic prompts.

Can user preferences be learned from just ten questions?

PReF learns base reward functions from preference data, then uses active learning to select maximally informative questions that reduce coefficient uncertainty. Users can be personalized via inference-time reward alignment without weight modification.

Show all 10 sources

Can neural networks explore efficiently at recommendation scale?

ENR separates aleatoric from epistemic uncertainty, focusing computation only on parameter uncertainty needed for Thompson sampling. It improved click-through rates 9% and ratings 6% while requiring 29% fewer interactions than baselines.

Can bandit algorithms beat collaborative filtering for news?

LinUCB frames news recommendation as a contextual bandit problem, explicitly balancing exploration of uncertain articles against exploitation of proven ones. The approach handles dynamic content and cold-start users better than traditional CF, with proven regret bounds and lower computational overhead.

Does step-level confidence outperform global averaging for trace filtering?

Local step-level confidence catches reasoning breakdowns that global averaging masks and enables early stopping before traces complete. This approach achieves comparable accuracy gains to naive majority voting with far fewer generated traces, proving trace quality matters more than quantity.

Can a model's partial response guide what to retrieve next?

ITER-RETGEN shows that iteratively using generated responses as retrieval queries substantially improves performance on multi-hop reasoning and fact verification. Generation acts as both answer producer and information-need clarifier, surfacing implicit gaps that the original query missed.

Can careful selection of 78 demos outperform massive training datasets?

LIMI achieves 73.5% on AgencyBench using only 78 curated multi-turn trajectories, outperforming models trained on 10,000+ samples by 53.7%. Complete interaction sequences capturing tool use and reasoning appear to activate latent agentic patterns already present in pretrained models.

Papers this line draws on 8

The research behind the notes this line reads — ranked by how closely each paper relates.

Scalable Neural Contextual Bandit for Recommender Systems1.71 match · arxiv ↗
Towards Agentic RAG with Deep Reasoning: A Survey of RAG-Reasoning Systems in LLMs1.65 match · arxiv ↗
A Contextual-Bandit Approach to Personalized News Article Recommendation1.62 match · arxiv ↗
Adaptive Retrieval Without Self-Knowledge? Bringing Uncertainty Back Home0.91 match · arxiv ↗
Language Model Personalization via Reward Factorization0.89 match · arxiv ↗
Deep Think with Confidence0.88 match · arxiv ↗
Uncertainty of Thoughts: Uncertainty-Aware Planning Enhances Information Seeking in Large Language Models0.88 match · arxiv ↗
Fact, Fetch, and Reason: A Unified Evaluation of Retrieval-Augmented Generation0.87 match · arxiv ↗

Research prompt for your LLMexpand ↓

Copy into ChatGPT or Claude to take this line of inquiry further — it asks the model to find newer work and re-test which earlier constraints still hold.

You are a researcher auditing claims about cheap informativeness estimation in LLMs. The question remains open: *can we estimate which samples or queries are most informative to a model's current state, without expensive recomputation?*

What a curated library found — and when (dated claims, not current truth):
Findings span 2010–2025; most actionable work clusters 2024–2025.
• Informativeness is *dynamic*: a sample's value depends on interaction between its difficulty and the model's present ability; static difficulty scores go stale within training steps (~2024).
• Gradient-based influence estimation identifies 5% of instruction data that beats training on full datasets; discarded data can actively harm reasoning (~2024).
• Token-probability uncertainty from the model itself outperforms elaborate multi-call heuristics for deciding whether to act on a sample, at lower compute (~2025).
• Information-gain simulation and active selection via uncertainty reduction work: ten adaptively chosen questions pin down personalized reward better than surveys (~2025).
• Step-level confidence catches reasoning breakdowns that whole-trace averaging misses; partial model responses reveal information gaps the original query didn't express (~2025).
• Aggressive curation (78 demonstrations, 5% of data, ten questions) outperforms abundance; most samples are noise or actively harmful (~2025).

Anchor papers (verify; mind their dates):
• arXiv:2402.04333 (LESS, Feb 2024) — gradient-based data selection for instruction tuning
• arXiv:2402.03271 (Uncertainty of Thoughts, Feb 2024) — uncertainty-aware planning for information seeking
• arXiv:2501.12835 (Adaptive Retrieval Without Self-Knowledge, Jan 2025) — token uncertainty vs. heuristic retrieval
• arXiv:2509.17567 (LIMI, Sep 2025) — agency from selective curation

Your task:
(1) RE-TEST EACH CONSTRAINT. For every claim above, ask: have newer optimizers (e.g., reinforcement learning-based sample weighting), training paradigms (curriculum learning, online adaptation), or evals (live model-in-the-loop benchmarks) since Sept 2025 *relaxed* the need for re-measurement, or proven static proxies that stay calibrated across model checkpoints? Separate the durable core — "informativeness depends on model state" — from claims that newer infrastructure (e.g., continual learning with cheap uncertainty refresh, batched gradient caching) may have solved.
(2) Surface the strongest *contradicting* or *superseding* work from the last 6 months: papers claiming static difficulty or information-theoretical invariants *do* hold across training, or showing that cheap estimation *fails* on real scales.
(3) Propose 2 research questions that *assume the regime has moved*: e.g., "If uncertainty-guided selection is now reliable, can we use it to automate curriculum design end-to-end?" or "Do the same uncertainty proxies work across model families, or do they require per-family calibration?"

Cite arXiv IDs; flag anything you cannot ground in a real paper.

Is there a quick, cheap way to spot which lessons a model will actually learn the most from — right now?

Related lines of inquiry

Sources 10 notes

Papers this line draws on 8