← All notes

How do reasoning models actually break under pressure?

Evaluating how reasoning models behave, where they break, and why their strengths in formal reasoning don't transfer to social tasks.

Topic Hub · 42 linked notes · 15 sections
View as

Sub-Maps

2 notes

Can we actually trust reasoning model outputs?

When reasoning models show their work through reflection and traces, do those explanations faithfully represent what's happening? This explores whether self-monitoring mechanisms genuinely correct errors or just create an illusion of reliability.

Explore related Read →

Where exactly do reasoning models fail and break?

Exploring the specific failure modes in reasoning models—from search inefficiency and mode selection errors to adversarial vulnerabilities and social reasoning gaps. Understanding these breaks is crucial for building more robust AI systems.

Explore related Read →

Writing Angles

4 notes

Is reflection in reasoning models actually fixing mistakes?

Do the thinking steps that appear after a model's first answer represent genuine self-correction, or are they mostly confirming what the model already concluded? Understanding this matters for how we train and deploy reasoning systems.

Explore related Read →

Can we monitor AI reasoning without destroying what makes it readable?

Explores the tension between using chain-of-thought traces to catch misbehavior and the risk that optimization pressures will make models hide their actual reasoning. Why readable reasoning might be incompatible with safe training.

Explore related Read →

Why do reasoning models abandon promising solution paths?

Explores whether reasoning models fail because they think insufficiently or because they structurally misorganize their thinking. Challenges the assumption that longer reasoning traces automatically improve performance.

Explore related Read →

Can LLM judges be tricked without accessing their internals?

Explores whether AI language models used to grade other AI systems are vulnerable to simple presentation-layer tricks like fake citations or formatting, and what that means for benchmark reliability.

Explore related Read →

Prompt Sensitivity and Robustness

1 note

Does model confidence predict robustness to prompt changes?

Explores whether a model's certainty about its answer determines how much it resists prompt rephrasing and semantic variation. This matters because it could explain why some tasks are harder to evaluate reliably.

Explore related Read →

Evaluation Methodology Failures

6 notes

Do popular prompting techniques actually improve model performance?

Five widely-cited prompting methods (chain-of-thought, emotion prompting, sandbagging, and others) are tested across multiple models and benchmarks to see if their reported improvements hold up under rigorous statistical analysis.

Explore related Read →

Does setting temperature to zero actually make LLM outputs reliable?

Explores whether deterministic LLM settings that produce consistent outputs also guarantee reliable judgments, and how to measure true reliability beyond surface consistency.

Explore related Read →

Is hallucination detection progress real or just metric artifacts?

Standard evaluation metrics for hallucination detection may systematically overstate how well methods actually work. The question asks whether reported improvements reflect genuine capability or measurement error.

Explore related Read →

Do short benchmarks predict how models perform over long workflows?

Standard LLM benchmarks measure single-turn performance, but real workflows involve sustained delegation across many turns. The question explores whether top benchmark performers maintain accuracy through longer interaction chains.

Explore related Read →

Are LLM and agent benchmarks really measuring different things?

Do LLM benchmarks and agent benchmarks test fundamentally different capabilities, or are they two modes of the same model? Understanding this shapes how we evaluate and develop AI systems.

Explore related Read →

Does supervised fine-tuning actually improve reasoning on optimization problems?

When SFT boosts benchmark scores on constraint-optimization tasks, does it genuinely improve the model's ability to find feasible solutions, or just its ability to format answers convincingly?

Explore related Read →

Diversity Evaluation Reframed

3 notes

Does preference tuning actually reduce the diversity of model outputs?

The field assumes RLHF and DPO reduce diversity, but this assumption rests on measuring all outputs equally. What happens if we only count diverse outputs that meet quality thresholds?

Explore related Read →

Does preference tuning always reduce diversity the same way?

Explores whether the standard narrative that RLHF reduces model diversity holds equally across different task domains, or if the effect varies by what the domain rewards.

Explore related Read →

Why aren't bigger models better for generating diverse outputs?

When generating many unique outputs within a fixed budget, does model size actually matter? Exploring whether the conventional wisdom of using larger models holds for diversity-focused tasks.

Explore related Read →

Cross-Paper Synthesis (2026-05-18)

1 note

Does a single benchmark score actually predict agent readiness?

Single-axis benchmarks rank models by one capability—like task success—but ignore privacy, duration, operating mode, and ecosystem fit. Can one number really capture what matters for deployment?

Explore related Read →

Multi-Dimensional Capability Evaluation

1 note

Do phone agents succeed at all three critical tasks equally?

Explores whether task success, privacy compliance, and preference reuse develop together in phone-use agents, or whether benchmarking one capability tells you nothing about the others.

Explore related Read →

Adaptive Sparsity as Difficulty Signal

3 notes

Do language models sparsify their activations under difficult tasks?

When LLMs encounter unfamiliar or difficult inputs, do their internal representations become sparser rather than denser? Understanding this adaptive response could reveal how models stabilize reasoning under uncertainty.

Explore related Read →

Can representation sparsity order few-shot demonstrations effectively?

Does measuring how sparse a model's hidden states are for each example provide a reliable signal for ordering few-shot demonstrations in prompts? This matters because curriculum ordering significantly affects in-context learning performance.

Explore related Read →

How much sparsity can different reasoning tasks actually tolerate?

Different NLP tasks show vastly different tolerance for sparse attention—from 95% on simple QA to 50-67% on multi-hop reasoning. What structural differences explain this variation, and how should it shape deployment decisions?

Explore related Read →

Constraint Satisfaction and Optimization Limits

3 notes

Do larger language models solve constrained optimization better?

Explores whether scaling LLMs—through more parameters, better training, or reasoning extensions—improves their ability to satisfy constraints in real optimization problems like power grids and portfolios.

Explore related Read →

Do reasoning models actually beat standard models on optimization?

Explores whether extended chain-of-thought in reasoning models delivers performance gains on constraint-satisfaction problems like power-grid optimization. Matters because reasoning models are treated as automatic upgrades, but the evidence may not support that claim.

Explore related Read →

Do fine-tuned language models actually learn optimization procedures?

Can RL fine-tuning teach LLMs to solve constraint-optimization problems through genuine reasoning, or does it merely sharpen pattern-matching? Testing on out-of-distribution variants reveals the mechanism.

Explore related Read →

Core Ideas

3 notes

How should we evaluate agent behavior beyond final answers?

Current agent evaluation focuses on endpoint correctness, but agentic systems unfold over time through interaction trajectories. What evidence and scoring methods should we use to capture process quality, recovery, and coordination?

Explore related Read →

Should interactive evaluation be designed as a unified paradigm?

As AI systems increasingly interact over time with tools and environments, evaluation practice must evolve. Should interactive evaluation be treated as a principled design science with shared protocols, or adopted incrementally as new benchmarks?

Explore related Read →

Do interactive evaluations actually solve the benchmark comparison problem?

Interactive, trajectory-based evaluation promises richer evidence than response-only benchmarks. But does moving to this format resolve longstanding challenges like comparability and reproducibility, or do those problems simply reappear at a new scale?

Explore related Read →

Benchmark Limits — Batch #3 backlog *(2026-06-03)*

2 notes

Do automated benchmarks hide what frontier AI systems can really do?

Benchmarks optimize for auto-gradable, short, cheap tasks. But real AI capability emerges in long-horizon, messy, open-ended work. How much capability are we missing—or wrongly inflating—by relying on benchmark scores alone?

Explore related Read →

Can frontier exams really measure cutting-edge AI capability?

Popular benchmarks like MMLU saturate quickly, hiding real capability differences. Can expert-designed closed-ended exams like Humanity's Last Exam discriminate at the frontier, and what would high scores actually tell us about AI systems?

Explore related Read →

Diversity, Live Benchmarks, Modest-Baseline Domains — Batch #3 wave 2 *(2026-06-03)*

3 notes

Where does mode collapse in language models really come from?

Researchers investigate whether mode collapse—when models narrow to repetitive outputs—stems from training algorithms or the preference data itself. Understanding the root cause is crucial for fixing diversity loss in creative and synthetic tasks.

Explore related Read →

Can live benchmarks prevent contamination in prediction tasks?

Real-time benchmarks that continuously gather questions and verify outcomes could solve the data contamination problem in forecasting evaluation. This matters because leaked training data makes it impossible to know if models truly predict or merely retrieve memorized answers.

Explore related Read →

Can language models beat human venture capital experts?

Explores whether LLMs can outperform top investors at founder success prediction in a domain where even experts show only modest accuracy. Tests whether AI forecasting is competitive in sparse-signal, high-uncertainty settings.

Explore related Read →

Human-preference and forecasting evaluation — Batch #4 backlog *(2026-06-03)*

2 notes

Can crowdsourced votes reliably rank language models?

Explores whether large-scale human preference voting from casual users produces valid model rankings comparable to expert judgment, and what makes such crowdsourced evaluation trustworthy at scale.

Explore related Read →

Can retrieval-augmented language models forecast like human experts?

Can language models augmented with search and reasoning match or exceed the forecasting accuracy of competitive human crowd forecasters on events beyond their training data? This tests whether AI can handle genuine predictive judgment.

Explore related Read →

Jury evaluation and temporal understanding — Batch #5 backlog *(2026-06-03)*

2 notes

Can smaller models in panels outperform a single large judge?

Does replacing one large language model judge with a diverse panel of smaller models improve evaluation quality while reducing cost and bias? This matters because LLM-based evaluation is widespread but suffers from expense and family-specific bias.

Explore related Read →

Can video language models actually understand time?

This research investigates whether video LLMs truly grasp temporal concepts like causality and event progression, or merely recognize spatial content across frames. Understanding this gap matters for video understanding tasks that depend on reasoning about time.

Explore related Read →