How should we allocate compute budget at inference time?
Test-time scaling explores how to spend computational resources during query rather than training. The core challenge: given a fixed inference budget, what's the optimal allocation strategy for different problems?