Can benchmark scores on verifiable tasks transfer to unseen problems outside the training domain?
This explores whether high scores on checkable tasks (math, code, puzzles with right answers) actually predict performance on problems the model never trained on — or whether they're inflated by memorization and break down off-distribution.
This explores whether high scores on checkable tasks (math, code, puzzles with right answers) actually predict performance on problems the model never trained on — and the corpus is unusually unified on the answer: mostly no, and it shows you several distinct mechanisms for why. The most direct evidence is contamination. A Qwen math model can reconstruct over half of a popular benchmark from partial prompts yet scores 0.0% on a benchmark released after its training cutoff Does RLVR success on math benchmarks reflect genuine reasoning improvement?. So a chunk of what looks like 'reasoning skill' is the model recognizing problems it has effectively already seen — which by definition can't transfer to genuinely unseen ones.
The deeper problem is that even non-contaminated gains can be distribution-bounded. Chain-of-thought reasoning degrades predictably when the task, length, or format shifts away from training — models keep producing fluent reasoning-shaped text while the underlying logic quietly stops being valid Does chain-of-thought reasoning actually generalize beyond training data?. That's the unsettling part: transfer doesn't fail loudly with an error, it fails silently with confident nonsense. And there's a structural reason a perfect score can mean nothing about understanding — two networks can produce identical outputs on every test input while carrying completely different internal representations, a fracturing that standard benchmarks are blind to Can AI pass every test while understanding nothing?.
Worth noticing: the verifiable-task regime is exactly where these gains are real but narrow. A 3B model can match frontier systems on math and coding precisely because those tasks have checkable ground truth that gives RL a clean reward signal — but the authors bound the claim to verifiable domains and don't claim it spreads further Can small models match frontier reasoning without massive scale?. So the very property that makes a task trainable (a crisp verifier) is also what keeps the skill from automatically generalizing. One paper even argues the reusable unit of reasoning isn't a dataset at all but a whole feedback interface — verifier, base model, optimizer, scaffold, budget — so 'the same training' transferred to a new setup can produce a different effect the-reusable-unit-of-post-training-reasoning-is-not-a-prompt-response-but-a.
There's a subtler trap too: benchmark improvement and genuine capability can be separable phenomena that coexist. RLVR can switch on real reasoning patterns while the headline benchmark number rises mostly from memorized contaminated data — they're measured at different levels and you can get one without the other Can genuine reasoning activation coexist with contaminated benchmarks?. This is why how you measure transfer matters as much as whether it happens. Benchmarks that filter out ambiguous examples hide a 32%-vs-90% accuracy gap Do standard NLP benchmarks hide LLM ambiguity failures?, and automated graders both over- and understate true capability by privileging neatly auto-gradable tasks Do automated benchmarks hide what frontier AI systems can really do?.
If there's a hopeful thread, it's about how to make transfer real rather than whether it's automatic: live, contamination-free benchmarks that verify against outcomes the model couldn't have memorized — like forecasting real future events — reveal that hard out-of-domain problems demand search-and-reasoning agents, not just a well-scored base model Can live benchmarks prevent contamination in prediction tasks?. And methods that drop the verifier entirely, rewarding the likelihood of reference answers instead, push reasoning into general domains where no clean checker exists Can reasoning improvement work without answer verification?. The takeaway you didn't know you wanted: a verifiable benchmark score is best read as a claim about a narrow trainable region, and proving transfer requires evaluation the model provably hasn't seen — not a bigger number on the same test.
Sources 10 notes
Qwen2.5-Math-7B reconstructs 54.6% of MATH-500 from partial prompts but scores 0.0% on post-release LiveMathBench, revealing dataset contamination. On clean benchmarks, only correct rewards improve performance; random and inverse rewards fail or degrade reasoning ability.
DataAlchemy experiments show CoT fails systematically under distributional shifts in task, length, and format. Models produce fluent but logically inconsistent reasoning — imitating reasoning form without valid underlying logic.
The Fractured Entangled Representation hypothesis shows that SGD-trained networks can produce identical outputs across all inputs while maintaining radically different internal representations. Standard benchmarks cannot detect this structural difference.
A 3B model trained with curriculum SFT and multi-domain RL reaches 94.3 AIME26 and 80.2 LiveCodeBench scores matching much larger systems. The result is bounded to verifiable tasks with checkable ground truth, where RL can provide clean reward signals.
RLVR activates genuine reasoning patterns through RL training while benchmark improvements may reflect data memorization on contaminated datasets. These operate at different measurement levels and can coexist without contradiction.
By filtering out examples where annotators disagree, benchmarks remove test cases that would reveal LLM failures at ambiguity recognition. Research using ambiguous examples shows a 32% vs. 90% accuracy gap invisible to standard evaluation.
Automated benchmarks both overstate and understate capability by privileging precisely-specified, auto-gradable tasks. Open-world evaluations of long-horizon messy tasks through qualitative log analysis—with cost explicitly reported—correct these distortions and catch emerging capabilities earlier.
FutureX, a live benchmark collecting questions from 195 sources and verifying real outcomes, shows that base models handle easy predictions but hard open-ended forecasting demands search-and-reasoning agents. This proves forecasting is an agentic capability, not a base-model strength.
VeriFree bypasses answer verification entirely by using the conditional probability of reference answers given generated reasoning traces as both reward signal and training weight. This approach matches or surpasses verifier-based methods on MMLU-Pro, GPQA, and SuperGPQA without rule-based or model-based verifiers.