Vrin is an AI Deep Search and Reasoning Engine designed to serve as the Cognitive Reasoning Core for teams and enterprises. It structures siloed company data into entity-centric Knowledge Graphs to enable multi-hop reasoning and cross-document insights.

How does Vrin differ from traditional search or RAG systems?

Unlike traditional RAG systems that retrieve chunks of text, Vrin builds entity-centric Knowledge Graphs that enable multi-hop reasoning across documents. This allows Vrin to synthesize insights from multiple sources and provide source-backed answers with full provenance.

What integrations does Vrin support?

Vrin integrates with popular enterprise tools including Zendesk, Intercom, Freshdesk, Slack, Confluence, Notion, Google Drive, SharePoint, Jira, GitHub, Linear, and more.

Is Vrin secure for enterprise use?

Yes, Vrin offers enterprise-grade security with SOC 2 compliance, data isolation, VPC deployment options, SSO integration, and air-gapped deployment for maximum security requirements.

Vrin - AI Deep Search & Reasoning Engine for Enterprise

When we first tested VRIN against industry-standard RAG benchmarks, we were skeptical of our own results. 97.5% accuracy on financial question answering seemed too good. So we tested again. And again. The numbers held.

This post is our attempt to be completely transparent about what we found, how we tested, and what it means for teams building AI applications that need to actually work.

The Results at a Glance

Benchmark	Metric	VRIN	Oracle Baseline	Improvement
RAGBench FinQA	Number Match	97.5% ±3.2%	79.4% (LLaMA 3.3-70B Oracle)	+22.8%
MultiHop-RAG	Semantic Accuracy	82.6% ±3.2%	63.0% (Multi-Meta RAG + GPT-4)	+31%

Results at 95% confidence (±3.2% margin) using 28% of each test dataset with Oracle context methodology. Full methodology, raw data, and reproduction scripts available on GitHub.

Important Note on Methodology: These benchmarks use "Oracle + Noise" context—each question receives a curated set of 2-5 documents that includes the relevant information plus some distractors. This measures reasoning quality (can the system extract and compute the correct answer?) rather than retrieval capability (can the system find the right documents from thousands?). We compare against other systems using the same Oracle context methodology for a fair comparison.

These aren't cherry-picked results. They're from statistically rigorous tests against public benchmarks with published Oracle baselines. Let us show you exactly how we got here.

Why These Benchmarks Matter

We chose these two benchmarks specifically because they test the hardest problems in enterprise RAG:

T²-RAGBench FinQA: The Table + Text Challenge

Financial documents are the ultimate stress test for RAG systems. They combine:

Dense tables with numerical data that must be precisely retrieved
Narrative text that provides context for those numbers
Multi-step reasoning to compute ratios, percentages, and comparisons

The benchmark contains 32,908 question-answer pairs from 9,095 real-world financial reports. When a financial analyst asks "What was the percentage change in goodwill from 2016 to 2017?", the system needs to find the right table, extract the correct cells, and compute the answer.

Most systems struggle here even with Oracle context. The T²-RAGBench leaderboard shows LLaMA 3.3-70B achieves 79.4% with Oracle context (where relevant documents are pre-provided). Retrieval-based systems drop to ~47% when they must find documents from a larger corpus.

MultiHop-RAG: The Cross-Document Reasoning Challenge

Real questions rarely have answers in a single document. MultiHop-RAG tests this with 2,556 queries that require synthesizing information across 2-4 documents.

Example query: "Which company, discussed by both TechCrunch and The Verge for its antitrust issues, paid billions to be the default search engine?"

Answering this requires:

Finding TechCrunch articles about antitrust
Finding Verge articles about antitrust
Identifying the common entity (Google)
Confirming the search engine default payment detail

The best published result is 63.0% using Multi-Meta RAG with GPT-4.

Our Testing Methodology

We followed rigorous statistical protocols to ensure our results are meaningful and reproducible.

Sample Design

Sample Coverage:  28% of each test dataset (~670 questions)
Confidence:       95% with ±3.2% margin of error
Sampling:         Random selection, reproducible seed (42)
Context Type:     Oracle + Noise (2-5 documents per question)

This sample size follows BetterBench statistical guidelines. Results plateaued across multiple test runs, indicating stable performance.

Test Protocol (Oracle Context)

For each benchmark question:

Ingest: Insert the provided documents for that question into VRIN (Oracle context)
Extract: Let VRIN's entity-centric pipeline process the content
Query: Submit the benchmark question
Evaluate: Compare VRIN's response against the expected answer

Note: This methodology matches how the T²-RAGBench leaderboard evaluates Oracle context performance. Each question receives its designated document set (2-5 documents, some relevant, some distractors). This tests reasoning capability, not retrieval from a large corpus.

Evaluation Criteria

FinQA (Number Match): Does VRIN's response contain the correct numerical values? This is a strict metric—partial credit isn't given.

MultiHop-RAG (Semantic Accuracy): Is the answer semantically correct? This accounts for VRIN providing verbose, contextual answers rather than one-word responses.

What's Actually Happening Under the Hood

The performance gap comes from architectural decisions, not just better prompts.

1. Entity-Centric Fact Extraction

Traditional RAG chunks documents and embeds them. VRIN does something different.

When VRIN ingests a financial report, it extracts structured facts:

Acme Corp → annual_revenue_2024 → $4.2 billion
Acme Corp → ceo → Jane Rivera
Jane Rivera → role → CEO of Acme Corp

These facts form a knowledge graph that preserves relationships. When you ask about Acme Corp's revenue, VRIN doesn't search through text chunks—it traverses relationships.

2. Hybrid Retrieval (Graph + Vector)

For any query, VRIN runs two retrieval paths in parallel:

Graph traversal: Finds facts directly connected to query entities
Vector search: Finds semantically similar content

The results are fused, giving you both precision (from the graph) and recall (from vectors).

3. Table-Aware Processing

Financial documents are full of tables. VRIN's extraction pipeline:

Detects table structures in documents
Preserves row/column relationships
Extracts cell values as discrete facts
Links table data to document context

This means "What was goodwill in 2017?" actually finds the table cell, not just text that mentions goodwill.

4. Entity Discovery from Documents

Here's a subtle but powerful capability: VRIN doesn't just match entities in your query—it discovers entities in retrieved documents and performs secondary graph traversals.

If you ask about "the choppy website issue" and the retrieved documents mention "Sarah Chen reported the problem," VRIN will discover Sarah Chen and find additional facts about her involvement.

Honest Assessment: What We Don't Do Well

We believe transparency builds trust. Here's where VRIN has room to improve:

Table Extraction Gaps

Our markdown table detection handles standard formats well, but complex nested tables or unusual layouts can trip it up. We're actively improving this with a dedicated development track.

Multi-Constraint Intersection

Queries with 3+ simultaneous constraints sometimes miss edge cases. "Find all Q4 2023 revenue figures for tech companies mentioned in both WSJ and Bloomberg" can get complex.

Very Large Tables

Tables with 50+ rows can exceed optimal chunk sizes. We're working on hierarchical table processing.

Statistical Rigor

Our tests sample 28% of each benchmark dataset (~670 questions), providing ±3.2% margin of error at 95% confidence following BetterBench guidelines. Results plateaued across multiple runs with reproducible seed (42).

What This Means for Your Team

These benchmark results translate to real capabilities for reasoning over provided documents:

If You're Building Financial Applications

97.5% accuracy on FinQA (Oracle context) means VRIN can reason accurately over financial reports when you provide the relevant documents. Due diligence, earnings analysis, regulatory compliance—tasks where numerical precision matters.

If You Have Knowledge Across Many Documents

82.6% on MultiHop-RAG means your AI can synthesize information across related documents the way a research analyst would. Legal discovery, competitive intelligence, technical documentation—anywhere answers span multiple sources you've ingested.

Understanding Oracle Context

These results use "Oracle context"—the relevant documents are provided for each question. In production, you'd ingest your document corpus and VRIN retrieves relevant content. The benchmark measures reasoning capability; production performance also depends on retrieval quality from your specific corpus.

Reproduce Our Results

Our benchmark scripts, raw results, and evaluation logs are fully open source:

# Clone the benchmark repository
git clone https://github.com/Vrin-cloud/vrin-benchmarks
cd vrin-benchmarks

# Install dependencies
pip install -r requirements.txt

# Run FinQA benchmark (requires VRIN API key)
python run_finqa_benchmark.py --sample 100

# Run MultiHop-RAG benchmark
python run_multihop_benchmark.py --sample 100

# View our actual results
ls results/  # Contains raw JSON logs from our runs

What's in the repo:

run_finqa_benchmark.py - FinQA evaluation script
run_multihop_benchmark.py - MultiHop-RAG evaluation script
README.md - Detailed methodology documentation

The evaluation logic is straightforward—no hidden post-processing or result filtering. Every answer is logged with the expected response for full transparency.

Try It With Your Own Documents

Benchmarks are useful, but your documents are what matter. We encourage you to:

Sign up at vrin.cloud
Ingest a few of your challenging documents
Ask the questions that current solutions can't answer well
Compare the results

If VRIN doesn't work for your use case, we'd rather know that than have you waste time.

What's Next

We're continuing to push on several fronts:

Full dataset validation on FinQA and MultiHop-RAG
Table extraction improvements for complex document layouts
Additional benchmarks including RAGTruth (hallucination detection)
Enterprise-specific benchmarks for legal, healthcare, and compliance domains

The Bottom Line

VRIN's performance on RAGBench FinQA (97.5%) and MultiHop-RAG (82.6%) demonstrates that the hybrid approach—knowledge graphs plus vector search—delivers superior reasoning capability on Oracle context benchmarks:

+22.8% better than LLaMA 3.3-70B Oracle baseline on FinQA
+31% better than Multi-Meta RAG + GPT-4 on MultiHop-RAG

This isn't about having a better language model. It's about smarter extraction, structured knowledge, and systems that reason across documents the way humans do. Our entity-centric extraction and temporal disambiguation enable accurate numerical reasoning that even 70B parameter models struggle with.

What these results mean: When given the right documents, VRIN extracts and reasons over financial data more accurately than leading LLMs. The next frontier is combining this reasoning capability with robust retrieval for real-world deployments.

The code is open. The methodology is documented. The results are reproducible. We're betting that transparency about what works (and what doesn't yet) is more valuable than marketing claims.

Questions about methodology or want to discuss results? Open an issue on GitHub or contact us at vrin.cloud

How We Achieved 97.5% Accuracy on Financial QA - 22% Better Than Oracle Baselines