OpenAI shows where artificial intelligence is already outperforming human experts
What is GDPval?
GDPval is based on nine industries that contribute the most to U.S. GDP, including healthcare, finance, manufacturing, and public administration. Within these areas, the test covered 44 occupations, from programmers to nurses to journalists. The first version, GDPval-v0, works by having experienced experts compare AI reports with human reports and select the better ones.
Testing results
- GPT-5-high (an upgraded version of GPT-5) was rated as better or equivalent to industry experts in 40.6 % cases.
- Claude Opus 4.1 (Anthropic) was rated better or equal in 49 % cases. OpenAI attributes this to the model's ability to create engaging graphics, not necessarily its content.
- For comparison: GPT-4o, released about 15 months ago, only achieved 13.7 %.
Testing limitations
OpenAI acknowledges that the current version of GDPval only covers a limited set of tasks—primarily research report writing. Most professions involve much more than just report writing. That’s why they plan to make future versions more robust, with more industries and interactive workflows.
Importance for the future of work
Despite the limitations, progress is evident. Dr. Aaron Chatterji, chief economist at OpenAI, believes that AI models can now offload some tasks and focus on higher-value tasks. Tejal Patwardhan of OpenAI adds that the progress over the past 15 months is encouraging and that he expects further growth in capabilities.
Silicon Valley already has a series of tests (e.g. AIME 2025 for math problems and GPQA Diamond for PhD-level science questions). But many models are already near the upper limit on these tests. GDPval could therefore become an important tool for measuring the actual utility of AI in the economy. For now, OpenAI will need to produce even larger versions before it can confidently claim that AI truly outperforms human experts.