Particle.news
Download on the App Store

New LIT-RAGBench Tests RAG Generators, Finds No Model Above 90% Accuracy

The arXiv release introduces a reproducible, generator-focused suite measuring logic, integration, table use, reasoning, plus abstention on a grounded Japanese dataset.

Overview

  • The paper was posted March 9, 2026 on arXiv with the dataset and evaluation code released on GitHub.
  • LIT-RAGBench defines five capability categories—Logic, Integration, Table, Reasoning, and Abstention—to evaluate generator behavior under unified conditions.
  • The dataset comprises 114 human-written Japanese questions with curated English translations, using fictional scenarios to force answers to be grounded in provided documents.
  • An LLM-as-a-Judge method scores responses with category-wise and overall accuracies, and initial tests show no API-based or open-weight model exceeds 90% overall.
  • The authors position the benchmark as a complement to retrieval-focused evaluations to guide model selection and RAG development, noting limitations such as the small, language-specific dataset and reliance on model-based judging.