Research

The gap between retrieval
and reasoning.

Current RAG systems delegate reasoning to the LLM, failing on multi-hop, temporal, and numerical queries. Vrin engineers five cognitive subprocesses the industry skipped. The results are measurable.

Whitepaper · April 2026

Vrin: From Retrieval to Reasoning. A Hybrid Knowledge Graph Architecture for Enterprise AI

A human analyst answering a complex question doesn't just search for similar text. They identify entities and relationships, formalize the question as a constrained search, know where relevant information lives, connect related facts across documents, and then pull the specific facts needed. Five cognitive subprocesses, yet current RAG addresses only the last one.

This paper presents Vrin, a hybrid knowledge graph architecture that engineers each of these five subprocesses explicitly. Rather than treating retrieval as a single undifferentiated step, Vrin implements a multi-stage reasoning pipeline: entity-centric fact extraction with coreference resolution and temporal versioning, a knowledge graph with community detection and cross-fact deduplication, graph-aware query planning, confidence-scored multi-hop traversal with Personalized PageRank, iterative reasoning with per-step quality evaluation, and structured context preparation that organizes evidence by concept rather than by source.

The architecture draws from established constructs in cognitive science: the brain's Complementary Learning Systems theory, semantic network theory, and metacognitive monitoring. Vrin independently converged on the same dual-store architecture that HippoRAG applied to RAG at NeurIPS, a convergence suggesting these engineering problems have a natural solution space.

Read the full whitepaper
Core thesis

Five subprocesses. RAG addresses one.

The industry spent three years optimizing retrieval: better embeddings, smarter chunking, bigger context windows. The other four subprocesses were left to the LLM.

01Perceive/Identify entities and relationshipsVrin
02Structure/Formalize the query as a constrained searchVrin
03Store/Know where relevant information livesVrin
04Organize/Connect related facts across documentsVrin
05Retrieve/Pull the specific facts neededVrin + RAG

Standard RAG addresses only subprocess 5 (Retrieve) through semantic similarity. Vrin engineers all five.

Experimental evaluation

Benchmark results.

Evaluated on two complementary academic leaderboards following BetterBench guidelines: fixed-seed sampling, confidence intervals, and open-source evaluation code.

Leaderboard

MultiHop-RAG

metric: Semantic Accuracy (SA)

Vrin

95.1%

ChatGPT 5.2 (Thinking) [Oracle Context]

78.9%

Multi-Meta RAG (GPT-4)

63.0%

Multi-Meta RAG (Google PaLM)

61.0%

GPT-4 Baseline

56.0%

Vrin tops the leaderboard even as ChatGPT 5.2 is given the exact evidence documents (oracle context) while Vrin retrieves from the full corpus. The gap is entirely attributable to structured reasoning.

Leaderboard

MuSiQue

metric: Exact Match (EM)

Vrin

47.8%

StepChain GraphRAG

43.9%

HopRAG

42.2%

SiReRAG

40.5%

HippoRAG 2

37.2%

Compositional multi-hop QA designed to resist shortcuts. Vrin leads the public leaderboard, ahead of StepChain GraphRAG, HopRAG, SiReRAG, and HippoRAG 2.

Where structure matters most

The performance gap between Vrin and ChatGPT 5.2 on MultiHop-RAG is largest on temporal queries (+48.9pp) and comparison queries (+15.5pp), precisely the query types that require understanding the structure of the question rather than finding semantically similar text. On inference queries (single-hop lookups), both systems perform equally well (99.2% vs 98.4%), confirming that the gap is architectural, not model-dependent.

Architecture

Reasoning before inference.

Before the LLM sees a single token, Vrin has already understood the query, consulted the knowledge graph, traversed multi-hop relationships, evaluated confidence, and organized evidence by concept.

Query complexity routing

Structural classification in under a millisecond determines retrieval depth. Simple factual lookups skip the full pipeline; complex queries get iterative multi-hop reasoning.

Graph-aware query planning

Before decomposing a query, Vrin consults the knowledge graph's structural metadata: what entities exist, which communities they belong to, what relationships connect them.

Multi-strategy graph traversal

Multi-hop beam search with hub-weighted Personalized PageRank, synonym edge resolution, and three parallel traversal strategies merged via reciprocal rank fusion.

5-dimensional confidence scoring

Entity coverage, type alignment, temporal alignment, fact density, and topical relevance producing three outcomes: proceed, supplement with exploratory retrieval, or bail out.

Iterative reasoning engine

Complex queries decomposed into dependency-ordered sub-questions with targeted retrieval per gap. Each iteration snapshots state and reverts if quality degrades.

Structured context preparation

Facts organized by entity and topic, cross-document connections stated as established insights, iterative reasoning chain injected. The LLM synthesizes from organized understanding.

Cognitive architecture

Informed by neuroscience, validated by benchmarks.

Vrin's architecture maps to established constructs in cognitive science. The dual-store knowledge graph (Neptune for structured facts, OpenSearch for unstructured passages) mirrors the brain's Complementary Learning Systems: the hippocampus for fast episodic indexing, the neocortex for slow, structured knowledge consolidation.

The multi-hop graph traversal implements spreading activation from semantic network theory: entities activate related entities along typed relationship edges, not through embedding similarity. Hub-weighted PageRank reflects how the brain organizes knowledge through hub-like multi-synaptic structures rather than point-to-point connections.

The confidence scoring system draws from metacognitive monitoring. The anterior cingulate cortex detects retrieval uncertainty and can halt processing when evidence is insufficient. Vrin's adaptive bail-out, which detects zero entity coverage and terminates in under 500ms, is a direct analog.

The nightly consolidation pipeline (community detection, cross-fact deduplication, usage-based stability scoring) mirrors sleep-dependent memory consolidation, where the brain restructures and strengthens frequently accessed knowledge pathways.

Looking forward

The 95% unexplored.

We believe the industry has explored less than 5% of the available innovation space in knowledge-augmented AI. The dominant focus has been on improving the retrieval subprocess: better embeddings, smarter reranking, larger context windows. The other four cognitive subprocesses (perception, structuring, storage, and organization), each with validated science behind them, remain largely unapplied.

Active areas of research include adaptive retrieval that makes finer-grained decisions about which pipeline stages to invoke, automatic domain specialization that detects query patterns and learns domain expertise from usage, and knowledge graph pattern detection that identifies frequently-accessed subgraphs and creates memory packs for fine-tuning domain-specialized models.

The fundamental thesis is that AI systems will eventually be specialized like human employees, not through fine-tuning a single model, but through engineering the cognitive infrastructure surrounding it.

Reasoning before inference

The context is reasoned over
before the model sees it.

Read the whitepaper, explore the benchmark code, or see it in action.