Overview
- The paper was posted March 9, 2026 on arXiv with the dataset and evaluation code released on GitHub.
- LIT-RAGBench defines five capability categories—Logic, Integration, Table, Reasoning, and Abstention—to evaluate generator behavior under unified conditions.
- The dataset comprises 114 human-written Japanese questions with curated English translations, using fictional scenarios to force answers to be grounded in provided documents.
- An LLM-as-a-Judge method scores responses with category-wise and overall accuracies, and initial tests show no API-based or open-weight model exceeds 90% overall.
- The authors position the benchmark as a complement to retrieval-focused evaluations to guide model selection and RAG development, noting limitations such as the small, language-specific dataset and reliance on model-based judging.