Gdpval: Evaluating Ai Model Performance On Real-world Economically Valuable Tasks

Paper · Source

!Pasted image 20250930085203.png

We introduce GDPval, a benchmark evaluating AI model capabilities on realworld economically valuable tasks. GDPval covers the majority of U.S. Bureau of Labor Statistics Work Activities for 44 occupations across the top 9 sectors contributing to U.S. GDP (Gross Domestic Product). Tasks are constructed from the representative work of industry professionals with an average of 14 years of experience. We find that frontier model performance on GDPval is improving roughly linearly over time, and that the current best frontier models are approaching industry experts in deliverable quality. We analyze the potential for frontier models, when paired with human oversight, to perform GDPval tasks cheaper and faster than unaided experts. We also demonstrate that increased reasoning effort, increased task context, and increased scaffolding improves model performance on GDPval. Finally, we open-source a gold subset of 220 tasks and provide a public automated grading service at evals.openai.com to facilitate future research in understanding real-world model capabilities.

Introduction. There is growing debate about how increasingly capable AI models could affect the labor market— whether by automating specific tasks, replacing entire occupations, or creating entirely new kinds of work (Brynjolfsson et al., 2025; Chen et al., 2025). Current approaches to measure the economic impact of AI focus on indicators such as adoption rates, usage patterns, and GDP growth attributed to AI (Chatterji et al., 2025; Tamkin et al., 2024; Appel et al., 2025; Acemoglu, 2025; Bick et al., 2024). However, historical evidence from technological shifts—such as electricity, airplanes, and computers—shows that the transition from invention to economy-wide permeation often takes years or even decades, requiring regulatory, cultural, and procedural changes (David, 1990; Brynjolfsson & Hitt, 2000; Brynjolfsson et al., 2019; Dwivedi et al., 2021; Solow, 1987). Therefore, while informative when available, these methods are lagging indicators of AI impacts. We consider an alternate method for understanding the potential economic impacts of AI: directly measuring AI model capabilities.

Discussion / Conclusion. We hope this work contributes to the science of tracking model progress, so that we have better data to assess the social impacts of AI models.

Lines of inquiry this paper opens 24

Research framings built by reading the notes related to this paper — the questions it feeds into.

How does AI adoption affect human skill development and labor equality?

Can single-axis benchmarks accurately predict agent deployment success?

Why do benchmark improvements fail to reflect actual reasoning quality?

How do we evaluate AI systems when user perception misleads actual performance?

How do evaluation mechanisms prevent error accumulation in autonomous research systems?

Why do static benchmarks miss frontier capabilities that open-world tasks reveal?

Does domain specialization cause models to lose capabilities elsewhere?

How can identical external performance mask different internal representations?

Why do benchmarks become saturated so quickly after initial launch?

How does objective evolution guide discovery better than fixed planning?

What makes evolving the benchmark different from evolving the optimizer itself?

How do professional roles and expertise transform with AI-generated content?

What role shifts occur when experts become custodians of AI knowledge?

Gdpval: Evaluating Ai Model Performance On Real-world Economically Valuable Tasks

Synthesis notes that discuss concepts related to this paper 8

Lines of inquiry this paper opens 24