Particle News: New LIT-RAGBench Tests RAG Generators, Finds No Model Above 90% Accuracy

Overview

The paper was posted March 9, 2026 on arXiv with the dataset and evaluation code released on GitHub.
LIT-RAGBench defines five capability categories—Logic, Integration, Table, Reasoning, and Abstention—to evaluate generator behavior under unified conditions.
The dataset comprises 114 human-written Japanese questions with curated English translations, using fictional scenarios to force answers to be grounded in provided documents.
An LLM-as-a-Judge method scores responses with category-wise and overall accuracies, and initial tests show no API-based or open-weight model exceeds 90% overall.
The authors position the benchmark as a complement to retrieval-focused evaluations to guide model selection and RAG development, noting limitations such as the small, language-specific dataset and reliance on model-based judging.