Test-Time Compute

When can weak models match strong model performance?

Can sampling many weak model calls replicate strong model results? This explores whether more attempts and selection mechanisms can bridge the performance gap without fundamentally stronger reasoning.

Does raw token spending actually predict agent performance?

Standard measures of agent effort—tokens, tool calls, operations—may not capture what makes inference-time scaling work. This explores what actually drives performance gains when agents spend more compute.

Does scale alone teach models to reason without hand-crafted rewards?

At what model size do large language models spontaneously discover self-verification and structured reasoning without auxiliary reward signals? This questions whether trillion-parameter scale removes the need for human-designed scaffolding.

Can we allocate inference compute based on prompt difficulty?

Does adjusting how much compute each prompt receives—rather than using a fixed budget—improve model performance? Could smarter allocation let smaller models compete with larger ones?

Does step-level confidence outperform global averaging for trace filtering?

Explores whether measuring confidence at individual reasoning steps—rather than averaging across entire traces—better identifies and filters out low-quality reasoning. Matters because it could dramatically improve both accuracy and compute efficiency in multi-trace reasoning.

Why do correct reasoning traces contain fewer tokens?

In o1-like models, correct solutions are systematically shorter than incorrect ones for the same questions. This challenges assumptions that longer reasoning traces indicate better reasoning, and raises questions about what length actually signals.

Do critique models improve diversity during training itself?

Explores whether critique integrated into the training loop, beyond test-time scoring, actively maintains solution diversity and prevents the model from converging too narrowly during iterative self-training.

Can verifiers monitor reasoning without slowing generation down?

Explores whether asynchronous verification can catch reasoning errors while keeping token costs near parity with unmonitored reasoning. Matters because current approaches trade between catching early errors and computational overhead.

When does explicit reasoning actually help model performance?

Explicit reasoning improves some tasks but hurts others. What determines whether step-by-step reasoning chains are beneficial or harmful for a given problem?

Does extended thinking actually improve reasoning or just increase variance?

When models think longer, do they reason better, or do they simply sample from a wider distribution of outputs that happens to cover correct answers more often? This matters because it determines whether test-time compute is genuinely scaling reasoning capability.

Can we automatically generate formal verifiers from policy text?

Verifier scarcity blocks process verification in most domains. Can language models synthesize correct-by-construction formal checkers directly from natural-language policies, bridging informal rules and rigorous proof?

Do hedging markers actually signal careful thinking in AI?

Explores whether linguistic markers like "alternatively" and "however" in model outputs correlate with accuracy or uncertainty. This matters because users often interpret such language as a sign of trustworthy reasoning.

How do internal and external test-time scaling compare?

Explores whether test-time scaling approaches fundamentally differ in where compute is spent: during training (internal) versus at inference (external). Understanding this split clarifies the trade-offs in deployment strategy and reasoning capability.

Do iterative refinement methods suffer from overthinking?

Iterative refinement approaches like Self-Refine structurally resemble token-level overthinking in o1-like models. Does revision across multiple inference calls reproduce the same accuracy degradation seen within single inferences?

Why does majority voting outperform more complex inference methods?

Simple majority voting across independent samples often matches or beats sophisticated alternatives like Best-of-N and sequential revision. What makes this basic approach so hard to beat for reasoning models?

Can non-reasoning models catch up with more compute?

Explores whether inference-time compute budget can close the performance gap between standard models and those trained for reasoning, and what training mechanisms might enable this.

Does thinking time need to scale with problem difficulty?

Can a single fixed compute budget work equally well across problems of varying difficulty, or does optimal thinking length change with how hard a problem is?

Why does more reasoning sometimes make models worse?

Explores whether extended thinking tokens genuinely improve model accuracy or can actively harm it by causing models to abandon correct answers mid-reasoning.

Why does parallel reasoning outperform single chain thinking?

Does dividing a fixed token budget across multiple independent reasoning paths beat spending it all on one long chain? This explores how breadth and diversity in reasoning compare to depth.

How should we balance parallel versus sequential compute at test time?

Test-time compute can prioritize breadth (trying many approaches) or depth (refining one approach). Which strategy works better, and does the answer depend on the problem?

Does policy entropy collapse limit reasoning performance in RL?

As reinforcement learning models become more confident in their policy choices, entropy drops and performance plateaus. Can we identify and counteract this bottleneck to sustain scaling?

Does more thinking time always improve reasoning accuracy?

Explores whether extending a model's thinking tokens linearly improves performance, or if there's a point beyond which additional reasoning becomes counterproductive.

Where do reasoning agents actually fail during long traces?

Does verifying only final answers miss the real sources of failure in multi-step reasoning? This explores whether intermediate process checks reveal errors that outcome-level scoring hides.

Does revising your own reasoning actually help or hurt?

Self-revision in reasoning models often degrades accuracy, while external critique improves it. Understanding what makes revision helpful or harmful could reshape how we design systems that need to correct themselves.

Does self-revision actually improve reasoning in language models?

When o1-like models revise their own reasoning through tokens like 'Wait' or 'Alternatively', does this reflection catch and fix errors, or does it introduce new mistakes? This matters because self-revision is marketed as a key capability.

Can self-supervised process rewards replace human annotation?

Self-supervised PRMs learn from outcome labels alone, avoiding expensive step-level annotation. The key question is whether this approach generalizes beyond math and code to domains with ambiguous correctness.

Can diverse mediocre traces outperform redundant expert traces?

Standard RL rewards individual traces for correctness, but what if a diverse set of weaker traces collectively helps an aggregator better than homogeneous strong ones? This explores whether group-level objectives differ fundamentally from item-level quality.

Can models precompute answers before users ask questions?

Most LLM applications maintain persistent state across interactions. Could models use idle time between queries to precompute useful inferences about that context, reducing latency when users actually ask?

When should AI systems do their thinking?

Most AI inference happens when users ask questions, but what if models could think during idle time instead? This explores whether shifting inference to before queries arrive could fundamentally change system design.

Can reading logit distributions break ties in LLM judging?

Standard LLM judges output discrete scores that create frequent ties between different solutions. Could computing expectations over scoring-token logits instead yield continuous scores that meaningfully discriminate between complex outputs?

Can routing mask future experts to prevent knowledge leakage?

Can models be built so that they respect query timestamps by selectively silencing experts trained on future data? This explores whether temporal causality can be enforced through architecture rather than external retrieval.

Can inference compute replace scaling up model size?

Explores whether smaller models given more thinking time during inference can match larger models. Matters because it reshapes deployment economics and compute allocation strategies.

Why does post-training ignore parallel and aggregative reasoning?

Post-training optimizes only sequential reasoning within single traces, yet inference uses parallel sampling and cross-trace aggregation. Does this train-inference gap explain why more test-time compute sometimes fails to help?

Can models improve themselves using only majority voting?

Explores whether test-time reinforcement learning can generate effective reward signals from unlabeled data by treating majority-voted answers as pseudo-labels, and whether this bootstrapping approach actually drives meaningful policy improvement.

When does majority-vote reward actually help test-time learning?

Test-time RL using consensus rewards shows contradictory results across different models and domains. What determines whether consensus amplifies correct answers or reinforces confident mistakes?

What makes test-time training actually work in practice?

Test-time training achieved striking gains on ARC tasks, but which components are truly essential? This explores what happens when you remove each of the three key ingredients.

Does more thinking time actually improve LLM reasoning?

The intuition that extended thinking helps LLMs reason better seems obvious, but what does the empirical data actually show when we test it directly?

Why do reasoning models fail differently at training versus inference?

Reasoning models exhibit two distinct failure modes—entropy collapse during training and variance inflation during inference—that appear unrelated but may share underlying causes. Understanding these dual problems could reveal whether separate or unified solutions are needed.

Can verification accuracy scale without retraining the model?

Does correctness-checking improve as its own inference-time scaling axis, separate from pre-training and post-training compute? This matters because it could unlock better feedback for reasoning without expensive model retraining.

How can we predict the optimal thinking token threshold?

Researchers are exploring what determines when a model should stop reasoning on a given task, since accuracy degrades beyond a critical threshold but no principled prediction method exists yet.