Can a model be strong at MMLU but weak at long-horizon tasks?
This explores whether benchmark strength on knowledge-recall tests like MMLU actually predicts how well a model holds up on long, multi-step tasks — and the corpus suggests these are nearly separate axes of competence.
This reads the question as: does scoring high on a static knowledge test (MMLU) tell you anything about whether a model can sustain a long chain of dependent steps? The corpus answers a fairly emphatic *yes, the two come apart* — and explains why. The clearest evidence is that frontier models which look strong on familiar benchmarks collapse on tasks demanding genuine backtracking: reasoning models like DeepSeek-R1 and o1-preview score only 20–23% on constraint-satisfaction problems, where fluency in producing reflective-sounding text does not translate into actually solving unfamiliar instance structures Can reasoning models actually sustain long-chain reflection?. A companion finding shows this is a ceiling, not a scaling gap: across constrained-optimization tasks LLMs plateau around 55–60% regardless of parameter count or architecture Do larger language models solve constrained optimization better?. More compute and more parameters move MMLU; they don't move the long-horizon ceiling.
Why would knowledge and endurance diverge? One framing is structural: an LLM is an autoregressive probability machine, and you can predict its failures from that fact alone — tasks with low-probability target sequences (counting letters, reversing the alphabet) stay hard even when they're logically trivial Can we predict where language models will fail?. MMLU questions sit in high-probability, well-trodden territory; a long-horizon task accumulates many low-probability steps, and errors compound. There's also a formal limit on a model fixing its own drift mid-task: self-improvement is bounded by a generation-verification gap, so a model can't reliably catch and correct its own accumulating mistakes through metacognition alone — it needs something external to validate each step What stops large language models from improving themselves?.
The most striking lateral point is that long-horizon competence may not live in the model at all. Nex-N1 found that autonomous-agent performance scales with the *environment* — complexity, diversity, and real-world fidelity — rather than model size; starve any one of those dimensions and generalization collapses no matter how capable the base model What blocks scaling from language models to autonomous agents?. In the same spirit, AgentFly reached ~88% on the GAIA agent benchmark purely through episodic memory operations, without touching model weights at all Can agents learn continuously from experience without updating weights?. If you can move long-horizon performance dramatically while holding the model fixed, then the model's MMLU score was never the bottleneck.
There's a deployment-mechanism angle too. Non-reasoning models can't be made to match reasoning models just by spending more inference compute — the gap comes from a *trained reasoning protocol* that makes extra tokens productive, not from raw knowledge Can non-reasoning models catch up with more compute?. And even structured workflows matter more than raw capability: LLMs forecast far better when the pipeline separates numerical from contextual reasoning, a strength monolithic prompting hides entirely Can LLMs actually forecast time series better than we think?. The thing you'd never guess from an MMLU leaderboard: a chunk of what we call "weak at long-horizon tasks" is really weak memory, weak environment, or weak scaffolding around a model that already knows enough — which is why architectures like Titans, which offload long-range state into a separate neural memory rather than asking attention to carry it, are aimed at exactly this seam Can neural memory modules scale language models beyond attention limits?.
Sources 9 notes
DeepSeek-R1 and o1-preview achieve only 20-23.6% exact match on 850 constraint satisfaction problems requiring genuine backtracking. This ceiling reveals that reflective reasoning fluency does not translate to actual problem-solving competence on unfamiliar instance structures.
Across constrained-optimization tasks, LLMs converge to ~55–60% constraint satisfaction independent of architecture, parameter count, or training regime. Reasoning models do not systematically outperform standard models, suggesting a fundamental ceiling rather than a scaling gap.
By framing LLMs as autoregressive probability machines, researchers predicted tasks with low-probability target responses would be systematically harder, even when logically simple. Experiments confirmed predictions like backwards alphabet and letter counting.
Self-improvement in LLMs is formally bounded by the generation-verification gap, meaning every reliable fix requires something external to validate and enforce it. Models cannot escape this constraint through metacognition alone.
Nex-N1 shows that autonomous agent performance depends on environment scaling along complexity, diversity, and real-world fidelity — not model size. Deficits in any single dimension collapse generalization, but scaling all three together enables frontier performance.
AgentFly formalizes agent learning as a Memory-augmented MDP with three memory modules (case, subtask, tool) that enable credit assignment and policy improvement entirely through memory operations. The approach achieved 87.88% on GAIA validation without modifying LLM parameters.
Reasoning models persistently outperform non-reasoning models regardless of inference budget because training instills a reasoning protocol that makes additional tokens productive. The gap is fundamentally about deployment mechanisms and training structure, not raw capability.
LLMs have stronger intrinsic forecasting ability than recognized, but only when workflows separate numerical reasoning from contextual reasoning. Monolithic prompting obscures this capability; structured decomposition surfaces it.
Titans architecture separates attention (short-term, quadratic) from neural memory (long-term, compressed), prioritizing surprising tokens for storage. The model outperforms standard Transformers and linear RNNs across tasks while scaling to 2M+ token contexts without quadratic penalties.