VrinVRiN
Back to Blog
benchmarks
RAG
performance
FinQA
MultiHop-RAG
hybrid-rag

How We Achieved 97.5% Accuracy on Financial QA - 22% Better Than Oracle Baselines

Vedant Patel
Vedant Patel

Founder & CEO

January 5, 2026
12 min read

When we first tested VRIN against industry-standard RAG benchmarks, we were skeptical of our own results. 97.5% accuracy on financial question answering seemed too good. So we tested again. And again. The numbers held.

This post is our attempt to be completely transparent about what we found, how we tested, and what it means for teams building AI applications that need to actually work.


The Results at a Glance

BenchmarkMetricVRINOracle BaselineImprovement
RAGBench FinQANumber Match97.5% ±3.2%79.4% (LLaMA 3.3-70B Oracle)+22.8%
MultiHop-RAGSemantic Accuracy82.6% ±3.2%63.0% (Multi-Meta RAG + GPT-4)+31%

Results at 95% confidence (±3.2% margin) using 28% of each test dataset with Oracle context methodology. Full methodology, raw data, and reproduction scripts available on GitHub.

Important Note on Methodology: These benchmarks use "Oracle + Noise" context—each question receives a curated set of 2-5 documents that includes the relevant information plus some distractors. This measures reasoning quality (can the system extract and compute the correct answer?) rather than retrieval capability (can the system find the right documents from thousands?). We compare against other systems using the same Oracle context methodology for a fair comparison.

These aren't cherry-picked results. They're from statistically rigorous tests against public benchmarks with published Oracle baselines. Let us show you exactly how we got here.


Why These Benchmarks Matter

We chose these two benchmarks specifically because they test the hardest problems in enterprise RAG:

T²-RAGBench FinQA: The Table + Text Challenge

Financial documents are the ultimate stress test for RAG systems. They combine:

  • Dense tables with numerical data that must be precisely retrieved
  • Narrative text that provides context for those numbers
  • Multi-step reasoning to compute ratios, percentages, and comparisons

The benchmark contains 32,908 question-answer pairs from 9,095 real-world financial reports. When a financial analyst asks "What was the percentage change in goodwill from 2016 to 2017?", the system needs to find the right table, extract the correct cells, and compute the answer.

Most systems struggle here even with Oracle context. The T²-RAGBench leaderboard shows LLaMA 3.3-70B achieves 79.4% with Oracle context (where relevant documents are pre-provided). Retrieval-based systems drop to ~47% when they must find documents from a larger corpus.

MultiHop-RAG: The Cross-Document Reasoning Challenge

Real questions rarely have answers in a single document. MultiHop-RAG tests this with 2,556 queries that require synthesizing information across 2-4 documents.

Example query: "Which company, discussed by both TechCrunch and The Verge for its antitrust issues, paid billions to be the default search engine?"

Answering this requires:

  1. Finding TechCrunch articles about antitrust
  2. Finding Verge articles about antitrust
  3. Identifying the common entity (Google)
  4. Confirming the search engine default payment detail

The best published result is 63.0% using Multi-Meta RAG with GPT-4.


Our Testing Methodology

We followed rigorous statistical protocols to ensure our results are meaningful and reproducible.

Sample Design

Sample Coverage: 28% of each test dataset (~670 questions) Confidence: 95% with ±3.2% margin of error Sampling: Random selection, reproducible seed (42) Context Type: Oracle + Noise (2-5 documents per question)

This sample size follows BetterBench statistical guidelines. Results plateaued across multiple test runs, indicating stable performance.

Test Protocol (Oracle Context)

For each benchmark question:

  1. Ingest: Insert the provided documents for that question into VRIN (Oracle context)
  2. Extract: Let VRIN's entity-centric pipeline process the content
  3. Query: Submit the benchmark question
  4. Evaluate: Compare VRIN's response against the expected answer

Note: This methodology matches how the T²-RAGBench leaderboard evaluates Oracle context performance. Each question receives its designated document set (2-5 documents, some relevant, some distractors). This tests reasoning capability, not retrieval from a large corpus.

Evaluation Criteria

FinQA (Number Match): Does VRIN's response contain the correct numerical values? This is a strict metric—partial credit isn't given.

MultiHop-RAG (Semantic Accuracy): Is the answer semantically correct? This accounts for VRIN providing verbose, contextual answers rather than one-word responses.


What's Actually Happening Under the Hood

The performance gap comes from architectural decisions, not just better prompts.

1. Entity-Centric Fact Extraction

Traditional RAG chunks documents and embeds them. VRIN does something different.

When VRIN ingests a financial report, it extracts structured facts:

Acme Corp → annual_revenue_2024 → $4.2 billion Acme Corp → ceo → Jane Rivera Jane Rivera → role → CEO of Acme Corp

These facts form a knowledge graph that preserves relationships. When you ask about Acme Corp's revenue, VRIN doesn't search through text chunks—it traverses relationships.

2. Hybrid Retrieval (Graph + Vector)

For any query, VRIN runs two retrieval paths in parallel:

  • Graph traversal: Finds facts directly connected to query entities
  • Vector search: Finds semantically similar content

The results are fused, giving you both precision (from the graph) and recall (from vectors).

3. Table-Aware Processing

Financial documents are full of tables. VRIN's extraction pipeline:

  • Detects table structures in documents
  • Preserves row/column relationships
  • Extracts cell values as discrete facts
  • Links table data to document context

This means "What was goodwill in 2017?" actually finds the table cell, not just text that mentions goodwill.

4. Entity Discovery from Documents

Here's a subtle but powerful capability: VRIN doesn't just match entities in your query—it discovers entities in retrieved documents and performs secondary graph traversals.

If you ask about "the choppy website issue" and the retrieved documents mention "Sarah Chen reported the problem," VRIN will discover Sarah Chen and find additional facts about her involvement.


Honest Assessment: What We Don't Do Well

We believe transparency builds trust. Here's where VRIN has room to improve:

Table Extraction Gaps

Our markdown table detection handles standard formats well, but complex nested tables or unusual layouts can trip it up. We're actively improving this with a dedicated development track.

Multi-Constraint Intersection

Queries with 3+ simultaneous constraints sometimes miss edge cases. "Find all Q4 2023 revenue figures for tech companies mentioned in both WSJ and Bloomberg" can get complex.

Very Large Tables

Tables with 50+ rows can exceed optimal chunk sizes. We're working on hierarchical table processing.

Statistical Rigor

Our tests sample 28% of each benchmark dataset (~670 questions), providing ±3.2% margin of error at 95% confidence following BetterBench guidelines. Results plateaued across multiple runs with reproducible seed (42).


What This Means for Your Team

These benchmark results translate to real capabilities for reasoning over provided documents:

If You're Building Financial Applications

97.5% accuracy on FinQA (Oracle context) means VRIN can reason accurately over financial reports when you provide the relevant documents. Due diligence, earnings analysis, regulatory compliance—tasks where numerical precision matters.

If You Have Knowledge Across Many Documents

82.6% on MultiHop-RAG means your AI can synthesize information across related documents the way a research analyst would. Legal discovery, competitive intelligence, technical documentation—anywhere answers span multiple sources you've ingested.

Understanding Oracle Context

These results use "Oracle context"—the relevant documents are provided for each question. In production, you'd ingest your document corpus and VRIN retrieves relevant content. The benchmark measures reasoning capability; production performance also depends on retrieval quality from your specific corpus.


Reproduce Our Results

Our benchmark scripts, raw results, and evaluation logs are fully open source:

# Clone the benchmark repository git clone https://github.com/Vrin-cloud/vrin-benchmarks cd vrin-benchmarks # Install dependencies pip install -r requirements.txt # Run FinQA benchmark (requires VRIN API key) python run_finqa_benchmark.py --sample 100 # Run MultiHop-RAG benchmark python run_multihop_benchmark.py --sample 100 # View our actual results ls results/ # Contains raw JSON logs from our runs

What's in the repo:

  • run_finqa_benchmark.py - FinQA evaluation script
  • run_multihop_benchmark.py - MultiHop-RAG evaluation script
  • README.md - Detailed methodology documentation

The evaluation logic is straightforward—no hidden post-processing or result filtering. Every answer is logged with the expected response for full transparency.


Try It With Your Own Documents

Benchmarks are useful, but your documents are what matter. We encourage you to:

  1. Sign up at vrin.cloud
  2. Ingest a few of your challenging documents
  3. Ask the questions that current solutions can't answer well
  4. Compare the results

If VRIN doesn't work for your use case, we'd rather know that than have you waste time.


What's Next

We're continuing to push on several fronts:

  • Full dataset validation on FinQA and MultiHop-RAG
  • Table extraction improvements for complex document layouts
  • Additional benchmarks including RAGTruth (hallucination detection)
  • Enterprise-specific benchmarks for legal, healthcare, and compliance domains

The Bottom Line

VRIN's performance on RAGBench FinQA (97.5%) and MultiHop-RAG (82.6%) demonstrates that the hybrid approach—knowledge graphs plus vector search—delivers superior reasoning capability on Oracle context benchmarks:

  • +22.8% better than LLaMA 3.3-70B Oracle baseline on FinQA
  • +31% better than Multi-Meta RAG + GPT-4 on MultiHop-RAG

This isn't about having a better language model. It's about smarter extraction, structured knowledge, and systems that reason across documents the way humans do. Our entity-centric extraction and temporal disambiguation enable accurate numerical reasoning that even 70B parameter models struggle with.

What these results mean: When given the right documents, VRIN extracts and reasons over financial data more accurately than leading LLMs. The next frontier is combining this reasoning capability with robust retrieval for real-world deployments.

The code is open. The methodology is documented. The results are reproducible. We're betting that transparency about what works (and what doesn't yet) is more valuable than marketing claims.


Questions about methodology or want to discuss results? Open an issue on GitHub or contact us at vrin.cloud

Share this article
Vedant Patel
Vedant Patel

Founder & CEO

Building the next generation of enterprise AI memory at VRIN. We believe in transparent research and open benchmarks.

More from VRIN

More articles coming soon. Subscribe to get notified.

View all articles