Particle.news
Download on the App Store

WSU-Led Study Finds ChatGPT Inconsistent on Scientific True-or-False, With Only Modest Gains

Researchers warn that fluent answers can mask weak reasoning, urging verification plus training.

Overview

  • Testing 719 hypotheses from recent business research, the team posed each as a true-or-false prompt to ChatGPT ten times to gauge accuracy and stability.
  • In 2024 the free ChatGPT-3.5 scored about 76.5% correct, rising to roughly 80% with a free 2025 model, yet performance was only about 60% better than chance after adjustment.
  • The system was poor at spotting falsehoods, correctly flagging false hypotheses only about 16.4% of the time.
  • Answers varied across identical prompts, with consistent responses appearing only about 73% of the time, including cases splitting five true and five false.
  • Findings published in Rutgers Business Review, led by Washington State University’s Mesut Cicek with co-authors from Southern Illinois, Rutgers, and Northeastern, recommend skepticism, verification, and targeted user training.