Overview
- A First Proof team convened at Harvard in early June to blind-grade AI-generated solutions to ten original, unpublished research problems and found passing answers for seven of them.
- Organizers used unpublished questions to prevent training-data leakage so the test measured genuine problem-solving rather than memorized solutions.
- Submitted solutions varied: some were flawless, some needed small human fixes, and some were incorrect, showing uneven reliability in model outputs.
- Judges said later success often relied on multiple attempts, advanced prompting or surrounding software tools called "AI harnesses," which helped models extend or check steps.
- The results add to recent high-profile claims like OpenAI’s reported disproof of an Erdős conjecture and have prompted calls for full reasoning disclosure, independent peer review, and wider use of formal proof assistants such as Lean.