TOPIC

LLM Evaluations and Benchmarks

20 synthesis notes · 106 source papers
View as

Can smaller models in panels outperform a single large judge?

Does replacing one large language model judge with a diverse panel of smaller models improve evaluation quality while reducing cost and bias? This matters because LLM-based evaluation is widespread but suffers from expense and family-specific bias.

Explore related Read →

How should we evaluate agent behavior beyond final answers?

Current agent evaluation focuses on endpoint correctness, but agentic systems unfold over time through interaction trajectories. What evidence and scoring methods should we use to capture process quality, recovery, and coordination?

Explore related Read →

Can frontier exams really measure cutting-edge AI capability?

Popular benchmarks like MMLU saturate quickly, hiding real capability differences. Can expert-designed closed-ended exams like Humanity's Last Exam discriminate at the frontier, and what would high scores actually tell us about AI systems?

Explore related Read →

Do transformers actually learn systematic compositional reasoning?

Explores whether transformers solve compositional tasks through genuine systematic reasoning or by pattern-matching against training data. This matters because it determines whether scaling alone can achieve robust generalization.

Explore related Read →

Does setting temperature to zero actually make LLM outputs reliable?

Explores whether deterministic LLM settings that produce consistent outputs also guarantee reliable judgments, and how to measure true reliability beyond surface consistency.

Explore related Read →

Can dictionary learning scale to production language models?

Sparse autoencoders recovered interpretable features from toy models, but scaling to real production systems like Claude remains uncertain. This matters because interpretability at scale is foundational for AI safety work.

Explore related Read →

Does preference tuning actually reduce the diversity of model outputs?

The field assumes RLHF and DPO reduce diversity, but this assumption rests on measuring all outputs equally. What happens if we only count diverse outputs that meet quality thresholds?

Explore related Read →

Can live benchmarks prevent contamination in prediction tasks?

Real-time benchmarks that continuously gather questions and verify outcomes could solve the data contamination problem in forecasting evaluation. This matters because leaked training data makes it impossible to know if models truly predict or merely retrieve memorized answers.

Explore related Read →

Can fairness frameworks extend to general-purpose language models?

Existing fairness frameworks were designed for narrow, structured tasks. This explores whether they scale to LLMs, which serve multiple populations, sensitive attributes, and use cases simultaneously.

Explore related Read →

Should interactive evaluation be designed as a unified paradigm?

As AI systems increasingly interact over time with tools and environments, evaluation practice must evolve. Should interactive evaluation be treated as a principled design science with shared protocols, or adopted incrementally as new benchmarks?

Explore related Read →

Do LLMs overgeneralize when summarizing scientific research?

When LLMs summarize science papers, do they drop important qualifiers and scope limits? This matters because such summaries might mislead readers about what findings actually show.

Explore related Read →

Can natural language explanations redefine what interpretability means?

Does the ability of LLMs to explain patterns in natural language fundamentally expand the scope and complexity of what humans can understand about AI systems, compared to traditional interpretability methods?

Explore related Read →

Do interactive evaluations actually solve the benchmark comparison problem?

Interactive, trajectory-based evaluation promises richer evidence than response-only benchmarks. But does moving to this format resolve longstanding challenges like comparability and reproducibility, or do those problems simply reappear at a new scale?

Explore related Read →

Where does mode collapse in language models really come from?

Researchers investigate whether mode collapse—when models narrow to repetitive outputs—stems from training algorithms or the preference data itself. Understanding the root cause is crucial for fixing diversity loss in creative and synthetic tasks.

Explore related Read →

Do automated benchmarks hide what frontier AI systems can really do?

Benchmarks optimize for auto-gradable, short, cheap tasks. But real AI capability emerges in long-horizon, messy, open-ended work. How much capability are we missing—or wrongly inflating—by relying on benchmark scores alone?

Explore related Read →

Does preference tuning always reduce diversity the same way?

Explores whether the standard narrative that RLHF reduces model diversity holds equally across different task domains, or if the effect varies by what the domain rewards.

Explore related Read →

Do popular prompting techniques actually improve model performance?

Five widely-cited prompting methods (chain-of-thought, emotion prompting, sandbagging, and others) are tested across multiple models and benchmarks to see if their reported improvements hold up under rigorous statistical analysis.

Explore related Read →

Is hallucination detection progress real or just metric artifacts?

Standard evaluation metrics for hallucination detection may systematically overstate how well methods actually work. The question asks whether reported improvements reflect genuine capability or measurement error.

Explore related Read →

Why aren't bigger models better for generating diverse outputs?

When generating many unique outputs within a fixed budget, does model size actually matter? Exploring whether the conventional wisdom of using larger models holds for diversity-focused tasks.

Explore related Read →

Can LLMs predict novel scientific results better than experts?

Do language models excel at forecasting experimental outcomes in neuroscience when given only method descriptions? This challenges the assumption that LLMs are mere knowledge retrievers rather than pattern integrators.

Explore related Read →

Source papers 106

The Arxiv papers behind this sub-topic. Links may take you off-site to arxiv.org.