Overview
- LIT-RAGBench debuts with five evaluation categories—Integration, Reasoning, Logic, Table, Abstention—to measure generator behavior under grounded retrieval.
- The dataset includes 114 human-authored Japanese questions with curated English translations, uses fictional entities to avoid contamination, and applies LLM-as-a-judge scoring.
- Across API-based and open-weight models, no system surpassed 90% overall accuracy, providing category-level breakdowns to guide model selection and development.
- An accessible DEV guide outlines the standard RAG pipeline of chunking, embeddings, vector search, and context-conditioned generation, emphasizing benefits like reduced hallucination and access to fresh or private data.
- Separately, KohakuRAG introduces hierarchical indexing with query planning, cross-query reranking, and ensemble voting with abstention-aware filtering, achieving first place on the WattBot 2025 Challenge with a 0.861 score and releasing code openly.