Most AI retrieval systems don't publish benchmark results. They show demos, cherry-picked examples, and customer quotes. We think that's a problem.
If you're asking enterprises to trust your system with their knowledge, you should be willing to prove it works on standardized, reproducible tests. Not on your own curated dataset. On the same benchmarks the research community uses to evaluate state-of-the-art systems.
We ran Vrin on two of the hardest multi-hop reasoning benchmarks in the literature. Here are the results, the methodology, and what we learned.
The Benchmarks
MultiHop-RAG
MultiHop-RAG tests whether a system can answer questions that require connecting information across multiple documents. It was designed specifically to expose the limitations of single-document retrieval.
The corpus contains 609 news articles. The questions require reasoning across 2-4 documents to arrive at the correct answer. This is not a keyword-matching exercise. The system needs to find the right documents, connect the relevant facts, and synthesize an answer.
MuSiQue
MuSiQue (Multi-hop Questions via Single-hop Question Composition) is an academic benchmark designed to test genuine multi-hop reasoning. Each question is constructed by composing single-hop questions, so the evaluation can verify whether the system actually performed each reasoning step or simply guessed from surface patterns.
The dataset contains 2,417 questions with associated supporting paragraphs. It's considered one of the most rigorous tests of multi-hop reasoning ability in the NLP community.
Results
MultiHop-RAG: 95.1% accuracy
| System | Accuracy | Notes |
|---|---|---|
| Vrin | 95.1% | Full 609-article corpus, realistic retrieval conditions |
| GPT-5.2 (oracle evidence) | 78.9% | Same documents provided directly to the model |
| Improvement | +16.2pp | Percentage point improvement over the best baseline |
Evaluation details:
- 384 stratified samples (seed=42)
- 95% confidence interval: [90.5%, 99.7%]
- Vrin retrieves from the full 609-article corpus under realistic conditions (no oracle evidence provided)
The baseline is notable: GPT-5.2 was given the same source documents directly in its context window (oracle evidence). Even with perfect document access, the model scored 78.9%. Vrin, retrieving from the full corpus without knowing which documents are relevant, scored 95.1%.
This is not a retrieval advantage. This is a reasoning advantage. Vrin's knowledge graph captures entity relationships and temporal facts that allow it to connect information across documents in ways that raw text similarity cannot.
MuSiQue: 28% better than academic SOTA
| System | Exact Match | F1 Score | Notes |
|---|---|---|---|
| Vrin | 0.478 | 0.563 | 300 questions, full pipeline |
| HippoRAG 2 (academic SOTA) | 0.372 | 0.486 | Published state-of-the-art |
| Improvement | +28.5% | +15.8% | Relative improvement |
Evaluation details:
- 300 multi-hop questions (seed=42, answerable subset from validation split)
- 4,848 paragraphs ingested (40,749 facts extracted, 39,412 stored)
- Average query latency: 58 seconds
- Average retrieval per query: 41 graph facts + 24 document chunks
By question complexity:
| Complexity | Exact Match | F1 | Count |
|---|---|---|---|
| Complex | 0.370 | 0.479 | 146 |
| Moderate | 0.394 | 0.475 | 104 |
| Simple | 0.360 | 0.442 | 50 |
The comparison point is HippoRAG 2, the current academic state-of-the-art for knowledge-graph-augmented retrieval. HippoRAG 2 uses Personalized PageRank over a knowledge graph, similar to one component of Vrin's pipeline. Vrin's improvement comes from the combination of multi-hop beam search, iterative reasoning (query decomposition), confidence-driven retrieval, and knowledge consolidation working together.
How Vrin Achieves These Results
The difference is not a single technique. It is the integration of several components that each address a different failure mode of traditional retrieval:
1. Structured knowledge graph (not just vectors)
When Vrin ingests a document, it doesn't just create text chunks and embeddings. It extracts structured facts as subject-predicate-object triples with timestamps and confidence scores. These facts are stored in a knowledge graph where entities are connected by typed relationships.
This means when a question asks "How did revenue change after the CEO transition?", Vrin doesn't search for text that contains those words. It traverses from the CEO entity to the transition event to the revenue facts, following actual relationships.
2. Multi-hop beam search
Vrin's graph retrieval doesn't stop at directly connected facts. It performs multi-hop beam search, following entity relationships across 2-3 hops with confidence-scored pruning at each step. Hub-weighted Personalized PageRank identifies the most important entities in the subgraph, ensuring that highly connected concepts get appropriate weight.
3. Iterative reasoning (query decomposition)
For complex questions, Vrin decomposes the query into atomic sub-questions, executes targeted retrieval for each gap, and injects structured chain-of-thought reasoning into the retrieval loop. Confidence-driven termination ensures the system stops when it has enough evidence, not after a fixed number of steps.
4. Dual retrieval with intelligent fusion
Graph facts and vector chunks are retrieved in parallel and fused using rank fusion. A multi-dimensional confidence scorer evaluates entity coverage, topical relevance, fact density, and source corroboration before any context reaches the language model. Low-confidence retrievals are caught early.
5. Knowledge consolidation
Vrin's knowledge graph isn't static. A consolidation pipeline runs periodically to deduplicate facts, detect contradictions, identify communities of related entities, and strengthen facts that consistently lead to good answers. Over time, the graph gets cleaner and more structured.
What We Learned
Graph structure matters more than retrieval tricks
The single biggest contributor to Vrin's accuracy is the knowledge graph itself, not any individual retrieval algorithm. When facts are stored as structured triples with typed relationships and temporal metadata, the retrieval problem becomes tractable. The system doesn't need to guess which chunks are relevant. It follows actual entity relationships.
Iterative reasoning helps on complex questions
Query decomposition improved performance primarily on complex multi-hop questions. For simple factual lookups, it adds latency without much accuracy gain. Vrin's auto-routing detects query complexity and only engages iterative reasoning when the question warrants it.
Consolidation prevents graph degradation
Without periodic consolidation, the knowledge graph accumulates duplicate facts, unresolved contradictions, and orphaned entities. The consolidation pipeline (community detection, 3-stage dedup, contradiction resolution) keeps the graph clean. This is especially important for large document corpora where entity name variations and fact drift are inevitable.
Reproducing These Results
Both benchmarks use publicly available datasets:
- MultiHop-RAG: github.com/yixuantt/MultiHop-RAG
- MuSiQue: huggingface.co/datasets/bdsaglam/musique (answerable, validation split)
Our evaluation methodology:
- Ingest all source documents through Vrin's standard pipeline (fact extraction + vector indexing)
- Query each benchmark question through Vrin's full retrieval and reasoning pipeline
- Extract short factoid answers from Vrin's verbose responses using GPT-4o-mini
- Compare extracted answers against gold-standard labels using exact match and F1 scoring
We use fixed random seeds (42) for all sampling to ensure reproducibility. No benchmark questions were used for training or tuning.
What This Means for Enterprise AI
Benchmarks are not production metrics. Real-world performance depends on document quality, domain complexity, and query patterns that no benchmark captures.
But benchmarks answer an important question: does the underlying approach work?
The answer, for knowledge-graph-augmented reasoning, is measurably yes. On the hardest multi-hop reasoning tasks in the literature, structured knowledge graphs outperform both vector-only retrieval and direct long-context approaches.
For enterprises evaluating AI knowledge systems, we'd suggest asking your vendors a simple question: what are your benchmark results on standardized tests? If they can't answer, you should ask why.
Vrin is knowledge reasoning infrastructure for enterprise AI. Evaluate it at vrin.cloud.
Founder & CEO
Building knowledge reasoning infrastructure for enterprise AI at VRIN. We believe in transparent research and open benchmarks.