Overview
- Testing 719 hypotheses from recent business research, the team posed each as a true-or-false prompt to ChatGPT ten times to gauge accuracy and stability.
- In 2024 the free ChatGPT-3.5 scored about 76.5% correct, rising to roughly 80% with a free 2025 model, yet performance was only about 60% better than chance after adjustment.
- The system was poor at spotting falsehoods, correctly flagging false hypotheses only about 16.4% of the time.
- Answers varied across identical prompts, with consistent responses appearing only about 73% of the time, including cases splitting five true and five false.
- Findings published in Rutgers Business Review, led by Washington State University’s Mesut Cicek with co-authors from Southern Illinois, Rutgers, and Northeastern, recommend skepticism, verification, and targeted user training.