ArXiv Preprint Says Prompts Drive Many LVLM Hallucinations

The authors present tools to test and reduce instruction-induced mistakes in image-text models.

Overview

A research preprint reports that many hallucinations in large vision‑language models—image‑text systems that answer questions about pictures—stem from the wording of the instructions rather than the visual input.
The team introduces HalluScope, a benchmark that probes how prompts, instructions, and other textual priors can push a model to invent details not found in the image.
To counter these errors, the authors propose HalluVL-DPO, a preference-optimization fine‑tuning method that trains models to choose responses grounded in the picture over tempting but unfounded text-based guesses.
In evaluations described by the authors, the tuned models reduced the targeted prompt-induced hallucinations while matching or improving scores on other grounding and visual capability tests.
The researchers say they will release the benchmark, curated preference dataset, and code for community review, and they note the findings are preliminary because the paper is an unreviewed arXiv preprint.