Overview
- The Nature-published study evaluated Merlin across six categories spanning more than 750 diagnostic, prognostic and quality tasks.
- The model was pretrained on a linked Stanford corpus of over 15,000 3D abdominal CT scans with radiology reports plus nearly one million diagnosis codes.
- In external testing on more than 50,000 unseen abdominal CTs from four hospitals, Merlin delivered stronger off-the-shelf performance than other vision–language models and matched or beat task-specific systems.
- On diagnosis-code prediction, Merlin reached about 81% average accuracy across 692 codes, rising to 90% on a 102-code subset, and it ranked five-year disease risk at 75% versus 68% for a comparator.
- Despite training only on abdomen scans, the model generalized to chest CT interpretation, and the team encourages local fine-tuning as NIH and MIDRC-backed work advances toward approvals for simpler tasks.