How We Achieved 97.5% Accuracy on Financial QA - 22% Better Than Oracle Baselines
Founder & CEO
Founder & CEO
When we first tested VRIN against industry-standard RAG benchmarks, we were skeptical of our own results. 97.5% accuracy on financial question answering seemed too good. So we tested again. And again. The numbers held.
This post is our attempt to be completely transparent about what we found, how we tested, and what it means for teams building AI applications that need to actually work.
| Benchmark | Metric | VRIN | Oracle Baseline | Improvement |
|---|---|---|---|---|
| RAGBench FinQA | Number Match | 97.5% ±3.2% | 79.4% (LLaMA 3.3-70B Oracle) | +22.8% |
| MultiHop-RAG | Semantic Accuracy | 82.6% ±3.2% | 63.0% (Multi-Meta RAG + GPT-4) | +31% |
Results at 95% confidence (±3.2% margin) using 28% of each test dataset with Oracle context methodology. Full methodology, raw data, and reproduction scripts available on GitHub.
Important Note on Methodology: These benchmarks use "Oracle + Noise" context—each question receives a curated set of 2-5 documents that includes the relevant information plus some distractors. This measures reasoning quality (can the system extract and compute the correct answer?) rather than retrieval capability (can the system find the right documents from thousands?). We compare against other systems using the same Oracle context methodology for a fair comparison.
These aren't cherry-picked results. They're from statistically rigorous tests against public benchmarks with published Oracle baselines. Let us show you exactly how we got here.
We chose these two benchmarks specifically because they test the hardest problems in enterprise RAG:
Financial documents are the ultimate stress test for RAG systems. They combine:
The benchmark contains 32,908 question-answer pairs from 9,095 real-world financial reports. When a financial analyst asks "What was the percentage change in goodwill from 2016 to 2017?", the system needs to find the right table, extract the correct cells, and compute the answer.
Most systems struggle here even with Oracle context. The T²-RAGBench leaderboard shows LLaMA 3.3-70B achieves 79.4% with Oracle context (where relevant documents are pre-provided). Retrieval-based systems drop to ~47% when they must find documents from a larger corpus.
Real questions rarely have answers in a single document. MultiHop-RAG tests this with 2,556 queries that require synthesizing information across 2-4 documents.
Example query: "Which company, discussed by both TechCrunch and The Verge for its antitrust issues, paid billions to be the default search engine?"
Answering this requires:
The best published result is 63.0% using Multi-Meta RAG with GPT-4.
We followed rigorous statistical protocols to ensure our results are meaningful and reproducible.
Sample Coverage: 28% of each test dataset (~670 questions) Confidence: 95% with ±3.2% margin of error Sampling: Random selection, reproducible seed (42) Context Type: Oracle + Noise (2-5 documents per question)
This sample size follows BetterBench statistical guidelines. Results plateaued across multiple test runs, indicating stable performance.
For each benchmark question:
Note: This methodology matches how the T²-RAGBench leaderboard evaluates Oracle context performance. Each question receives its designated document set (2-5 documents, some relevant, some distractors). This tests reasoning capability, not retrieval from a large corpus.
FinQA (Number Match): Does VRIN's response contain the correct numerical values? This is a strict metric—partial credit isn't given.
MultiHop-RAG (Semantic Accuracy): Is the answer semantically correct? This accounts for VRIN providing verbose, contextual answers rather than one-word responses.
The performance gap comes from architectural decisions, not just better prompts.
Traditional RAG chunks documents and embeds them. VRIN does something different.
When VRIN ingests a financial report, it extracts structured facts:
Acme Corp → annual_revenue_2024 → $4.2 billion Acme Corp → ceo → Jane Rivera Jane Rivera → role → CEO of Acme Corp
These facts form a knowledge graph that preserves relationships. When you ask about Acme Corp's revenue, VRIN doesn't search through text chunks—it traverses relationships.
For any query, VRIN runs two retrieval paths in parallel:
The results are fused, giving you both precision (from the graph) and recall (from vectors).
Financial documents are full of tables. VRIN's extraction pipeline:
This means "What was goodwill in 2017?" actually finds the table cell, not just text that mentions goodwill.
Here's a subtle but powerful capability: VRIN doesn't just match entities in your query—it discovers entities in retrieved documents and performs secondary graph traversals.
If you ask about "the choppy website issue" and the retrieved documents mention "Sarah Chen reported the problem," VRIN will discover Sarah Chen and find additional facts about her involvement.
We believe transparency builds trust. Here's where VRIN has room to improve:
Our markdown table detection handles standard formats well, but complex nested tables or unusual layouts can trip it up. We're actively improving this with a dedicated development track.
Queries with 3+ simultaneous constraints sometimes miss edge cases. "Find all Q4 2023 revenue figures for tech companies mentioned in both WSJ and Bloomberg" can get complex.
Tables with 50+ rows can exceed optimal chunk sizes. We're working on hierarchical table processing.
Our tests sample 28% of each benchmark dataset (~670 questions), providing ±3.2% margin of error at 95% confidence following BetterBench guidelines. Results plateaued across multiple runs with reproducible seed (42).
These benchmark results translate to real capabilities for reasoning over provided documents:
97.5% accuracy on FinQA (Oracle context) means VRIN can reason accurately over financial reports when you provide the relevant documents. Due diligence, earnings analysis, regulatory compliance—tasks where numerical precision matters.
82.6% on MultiHop-RAG means your AI can synthesize information across related documents the way a research analyst would. Legal discovery, competitive intelligence, technical documentation—anywhere answers span multiple sources you've ingested.
These results use "Oracle context"—the relevant documents are provided for each question. In production, you'd ingest your document corpus and VRIN retrieves relevant content. The benchmark measures reasoning capability; production performance also depends on retrieval quality from your specific corpus.
Our benchmark scripts, raw results, and evaluation logs are fully open source:
# Clone the benchmark repository git clone https://github.com/Vrin-cloud/vrin-benchmarks cd vrin-benchmarks # Install dependencies pip install -r requirements.txt # Run FinQA benchmark (requires VRIN API key) python run_finqa_benchmark.py --sample 100 # Run MultiHop-RAG benchmark python run_multihop_benchmark.py --sample 100 # View our actual results ls results/ # Contains raw JSON logs from our runs
What's in the repo:
run_finqa_benchmark.py - FinQA evaluation scriptrun_multihop_benchmark.py - MultiHop-RAG evaluation scriptREADME.md - Detailed methodology documentationThe evaluation logic is straightforward—no hidden post-processing or result filtering. Every answer is logged with the expected response for full transparency.
Benchmarks are useful, but your documents are what matter. We encourage you to:
If VRIN doesn't work for your use case, we'd rather know that than have you waste time.
We're continuing to push on several fronts:
VRIN's performance on RAGBench FinQA (97.5%) and MultiHop-RAG (82.6%) demonstrates that the hybrid approach—knowledge graphs plus vector search—delivers superior reasoning capability on Oracle context benchmarks:
This isn't about having a better language model. It's about smarter extraction, structured knowledge, and systems that reason across documents the way humans do. Our entity-centric extraction and temporal disambiguation enable accurate numerical reasoning that even 70B parameter models struggle with.
What these results mean: When given the right documents, VRIN extracts and reasons over financial data more accurately than leading LLMs. The next frontier is combining this reasoning capability with robust retrieval for real-world deployments.
The code is open. The methodology is documented. The results are reproducible. We're betting that transparency about what works (and what doesn't yet) is more valuable than marketing claims.
Questions about methodology or want to discuss results? Open an issue on GitHub or contact us at vrin.cloud
Founder & CEO
Building the next generation of enterprise AI memory at VRIN. We believe in transparent research and open benchmarks.
More articles coming soon. Subscribe to get notified.
View all articles