LLM Evaluations and Benchmarks

Can smaller models in panels outperform a single large judge?

Does replacing one large language model judge with a diverse panel of smaller models improve evaluation quality while reducing cost and bias? This matters because LLM-based evaluation is widespread but suffers from expense and family-specific bias.

How should we evaluate agent behavior beyond final answers?

Current agent evaluation focuses on endpoint correctness, but agentic systems unfold over time through interaction trajectories. What evidence and scoring methods should we use to capture process quality, recovery, and coordination?

Can frontier exams really measure cutting-edge AI capability?

Popular benchmarks like MMLU saturate quickly, hiding real capability differences. Can expert-designed closed-ended exams like Humanity's Last Exam discriminate at the frontier, and what would high scores actually tell us about AI systems?

Do transformers actually learn systematic compositional reasoning?

Explores whether transformers solve compositional tasks through genuine systematic reasoning or by pattern-matching against training data. This matters because it determines whether scaling alone can achieve robust generalization.

How can separating evaluation components make reward-hacking visible?

Can breaking benchmarks into independent Benchmark, Harness, and Environment components enable trajectory analysis that reveals failure modes like reward-hacking, which scalar scores typically conceal?

How can subtask grading reveal agent progress on long tasks?

On tasks lasting hours, binary pass-fail verdicts hide meaningful partial progress. Can decomposing long tasks into fine-grained subtasks with intermediate rewards expose what agents actually accomplish?

Does setting temperature to zero actually make LLM outputs reliable?

Explores whether deterministic LLM settings that produce consistent outputs also guarantee reliable judgments, and how to measure true reliability beyond surface consistency.

Can dictionary learning scale to production language models?

Sparse autoencoders recovered interpretable features from toy models, but scaling to real production systems like Claude remains uncertain. This matters because interpretability at scale is foundational for AI safety work.

Does preference tuning actually reduce the diversity of model outputs?

The field assumes RLHF and DPO reduce diversity, but this assumption rests on measuring all outputs equally. What happens if we only count diverse outputs that meet quality thresholds?

Can live benchmarks prevent contamination in prediction tasks?

Real-time benchmarks that continuously gather questions and verify outcomes could solve the data contamination problem in forecasting evaluation. This matters because leaked training data makes it impossible to know if models truly predict or merely retrieve memorized answers.

Can fairness frameworks extend to general-purpose language models?

Existing fairness frameworks were designed for narrow, structured tasks. This explores whether they scale to LLMs, which serve multiple populations, sensitive attributes, and use cases simultaneously.

Should interactive evaluation be designed as a unified paradigm?

As AI systems increasingly interact over time with tools and environments, evaluation practice must evolve. Should interactive evaluation be treated as a principled design science with shared protocols, or adopted incrementally as new benchmarks?

Do LLMs overgeneralize when summarizing scientific research?

When LLMs summarize science papers, do they drop important qualifiers and scope limits? This matters because such summaries might mislead readers about what findings actually show.

Can natural language explanations redefine what interpretability means?

Does the ability of LLMs to explain patterns in natural language fundamentally expand the scope and complexity of what humans can understand about AI systems, compared to traditional interpretability methods?

Do interactive evaluations actually solve the benchmark comparison problem?

Interactive, trajectory-based evaluation promises richer evidence than response-only benchmarks. But does moving to this format resolve longstanding challenges like comparability and reproducibility, or do those problems simply reappear at a new scale?

LLM Evaluations and Benchmarks

Can smaller models in panels outperform a single large judge?

How should we evaluate agent behavior beyond final answers?

Can frontier exams really measure cutting-edge AI capability?

Do transformers actually learn systematic compositional reasoning?

How can separating evaluation components make reward-hacking visible?

How can subtask grading reveal agent progress on long tasks?

Does setting temperature to zero actually make LLM outputs reliable?

Can dictionary learning scale to production language models?

Does preference tuning actually reduce the diversity of model outputs?

Can live benchmarks prevent contamination in prediction tasks?

Can fairness frameworks extend to general-purpose language models?

Should interactive evaluation be designed as a unified paradigm?

Do LLMs overgeneralize when summarizing scientific research?

Can natural language explanations redefine what interpretability means?

Do interactive evaluations actually solve the benchmark comparison problem?

Where does mode collapse in language models really come from?

What predicts success in ultra-long-horizon agent tasks?

Do automated benchmarks hide what frontier AI systems can really do?

Does preference tuning always reduce diversity the same way?

Do popular prompting techniques actually improve model performance?

Is hallucination detection progress real or just metric artifacts?

Why aren't bigger models better for generating diverse outputs?

Why do agent benchmarks not predict real economic value?

Can LLMs predict novel scientific results better than experts?