FormulaOne: Measuring the Depth of Algorithmic Reasoning Beyond Competitive Programming

Paper · arXiv 2507.13337 · Published July 17, 2025
LLM Evaluations and BenchmarksDomain Specialization in LLMs

Abstract Frontier AI models demonstrate formidable breadth of knowledge. But how close are they to true human — or superhuman — expertise? Genuine experts can tackle the hardest problems and push the boundaries of scientific understanding. To illuminate the limits of frontier model capabilities, we turn away from contrived competitive programming puzzles, and instead focus on real-life research problems. We construct FormulaOne, a benchmark that lies at the intersection of graph theory, logic, and algorithms, all well within the training distribution of frontier models. Our problems are incredibly demanding, requiring an array of reasoning steps, involving topological and geometric insight, mathematical knowledge, combinatorial considerations, precise implementation, and more. The dataset has three key properties. First, it is of commercial interest and relates to practical large-scale optimisation problems, such as those arising in routing, scheduling, and network design. Second, it is generated from the highly expressive framework of Monadic Second-Order (MSO) logic on graphs, paving the way toward automatic problem generation at scale — ideal for building RL environments.

Introduction. Artificial Intelligence (AI) holds the promise of solving the world’s hardest scientific, algorithmic, and mathematical challenges—problems so complex they baffle even the brightest human minds. Current benchmarks, however, often do not paint a complete picture of AI’s depth of understanding. While recent achievements are remarkable, such as OpenAI-o3 attaining a 2,724 rating on CodeForces or securing a gold medal at the International Olympiad in Informatics [EWS+25], they nevertheless mask a sobering reality: the skills honed for these competitions do not capture the full spectrum of reasoning needed for large-scale, real-world research problems. Tasks such as optimising global supply chains, managing large-scale power grids, and designing resilient network infrastructures are orders of magnitude harder, requiring algorithmic insight that goes far beyond the scope of typical competitive programming. To this end, we introduce FormulaOne, a benchmark centred around dynamic programming over graphs—an algorithmic cornerstone of real-world optimisation.

Discussion / Conclusion. While frontier AI models achieve high ratings in top human level competitive programming, they fail on more challenging algorithmic challenges, such as the ‘hard’ problems in our FormulaOne. This serves to highlight that current benchmarks, which often rely on problems solvable by human experts, are insufficient for measuring the deep algorithmic and combinatorial reasoning required for complex, real-world research tasks. Given the depth of reasoning these problems demand, future progress may depend on incorporating more principled approaches, such as systematic search, rather than relying solely on the emergent capabilities of current models. Another of our contributions is the harnessing of Monadic Second-Order Logic on graphs, in order to make the first steps towards a principled method of creating a virtually unlimited suite of hard yet solvable problems.