Should evaluations shift toward open-world messy tasks instead of contests?
This explores whether AI evaluation should move away from clean, single-score contests (leaderboards, one-shot benchmarks) and toward the kind of long, ambiguous, multi-step work real systems actually do — and what the corpus says is gained or lost in that shift.
This question reads as: are tidy benchmark contests measuring the wrong thing, and should we grade AI on messy, open-ended work instead? The corpus doesn't take a side so much as show *why* the contest format quietly distorts what we think we're measuring — and what a messier evaluation would have to capture instead.
The sharpest warning against contest-style scoring is that it's easy to game with style rather than substance. Models trained to imitate ChatGPT fool human evaluators with confident, fluent prose while closing no actual capability gap — the score goes up, the underlying ability doesn't Can imitating ChatGPT fool evaluators into thinking models improved?. That's the core indictment of a contest: a single judged output rewards the *appearance* of quality. The same pathology shows up when you personalize the judge — reward models tuned per user start rewarding agreement and flattery, amplifying echo chambers, because the evaluation no longer has any anchor outside the user's own preferences Does personalizing reward models amplify user echo chambers?.
What open-world tasks reveal that contests hide is *persistence over time*. On 36 expert-curated optimization tasks across 17 frontier models, the dominant predictor of success wasn't the quality of the first attempt — it was whether a model kept iterating through benchmark-edit-incorporate cycles instead of quitting early or burning its budget unproductively What predicts success in ultra-long-horizon agent tasks?. A one-shot contest literally cannot see this dimension; it scores the opening move and stops. Messy long-horizon tasks make stamina, recovery, and not-giving-up into measurable properties.
But the corpus also shows the cost of the shift: messy tasks are far harder to grade reliably, and naive automation makes it worse. An eight-module agentic evaluator that actively collects evidence cut 'judge shift' to 0.27% versus 31% for a plain LLM-as-judge on complex tasks — a hundredfold gain — yet its own memory module cascaded errors, so the more open-ended evaluation needed error-isolation just to stay trustworthy Can agents evaluate AI outputs more reliably than language models?. The lesson recurs: judges that *reason about the reasoning* (generating critique chains rather than emitting a classifier score) are both more accurate and more data-efficient on hard problems Can judges that reason about reasoning outperform classifier rewards?. The richer the task, the more the evaluator has to think rather than tally.
So the deeper point isn't 'contests bad, messy tasks good' — it's that a single scalar is the wrong shape for open-world feedback. Real feedback decomposes into two things a number can't hold at once: *evaluative* (how well did it go) and *directive* (what should change) Can scalar rewards capture all the information in agent feedback?. That's exactly why models stuck on a numerical-reward plateau break through when given natural-language critiques explaining *why* they failed Can natural language feedback overcome numerical reward plateaus?. A contest gives you the evaluative bit and throws the directive bit away. The thing you didn't know you wanted to know: shifting toward messy tasks isn't mainly about realism — it's that messy tasks force evaluation to recover the directional information clean leaderboards were structurally throwing on the floor.
Sources 7 notes
Imitation models fool human evaluators by mimicking ChatGPT's confident, fluent style while failing to improve factuality or generalization on novel tasks. The ceiling is set by base model capability, not fine-tuning method—better fundamentals, not shortcuts, drive real improvement.
Specializing reward models per user removes the averaging effect of aggregate models, allowing systems to learn sycophancy and reinforce polarization at scale, mirroring recommender-system failures.
Across 17 frontier models on 36 expert-curated optimization tasks, repeated benchmark-edit-incorporate cycles within a wall-clock budget proved the dominant success predictor. Most models terminated early or burned budget unproductively; Claude Opus 4.6 stood out as persistent.
Eight-module agentic evaluation achieved 0.27% judge shift versus 31% for LLM-as-a-Judge on complex tasks. However, the memory module cascaded errors, revealing that agentic systems need error isolation mechanisms to maintain gains.
StepWiser demonstrates that training judges to produce reasoning chains about policy reasoning—rather than classify steps—yields better judgment accuracy and data efficiency. Independent confirmation from GenPRM and ThinkPRM shows generative PRMs outperform discriminative ones with orders of magnitude less training data.
Natural feedback carries two orthogonal types of information: evaluative (how well an action performed) and directive (how it should change). Scalar rewards capture evaluation but discard directional specifics that token-level distillation can recover, making the two complementary rather than redundant.
Critique-GRPO shows that models stuck on performance plateaus can generate correct solutions when given chain-of-thought critiques, revealing that numerical rewards lack critical information about why failures occur and how to improve.