Overview
- Developers behind the benchmark say top models could approach full marks within months to about a year, framing it as a goal rather than a guarantee.
- Performance has climbed fast, with Google’s Gemini rising from 18.8% to 45.9% last month, Anthropic’s Claude at 34.2%, and OpenAI’s GPT‑4o under 3% in late 2024.
- Humanity’s Last Exam is a 2,500‑question, closed‑answer test spanning about 100 fields, and it aims to measure both broad knowledge and deep reasoning.
- Organisers drew roughly 70,000 proposed questions from experts in about 50 countries, paid out a $500,000 prize pool, filtered to about 13,000, then fixed on 2,500 while keeping many items secret to block ‘benchmark hacking’.
- The test’s authors call it the final closed‑ended academic benchmark, and contributors note that even perfect scores would not replace human skills like surgical practice, creativity, and judgement, pushing future evaluations toward problems with no known answers.