Apr 9, 2026·research

Karpathy Is Right About LLM Knowledge Bases. Here's What Happens When You Stress-Test the Idea.

We ran the same strategic question through two workflows: a local filesystem agent reading files directly vs Vrin's graph-aware retrieval. Same AI, same documents, same question. Here's what we found.

Vedant Patel
Vedant Patel·10 min read

Andrej Karpathy's recent tweet about LLM Knowledge Bases struck a nerve because it names something engineers have been feeling for a while: we've been building smarter models and better orchestration, but the thing that determines whether an AI agent gives you a good answer, the knowledge it reasons over, is still an afterthought.

He describes the ideal architecture as a "small cognitive core" that fetches information as needed from vast external knowledge. Not a model that loads everything into context. Not a bigger context window. A compact reasoning engine backed by structured, retrievable knowledge.

We wanted to test that intuition with real data. So we ran an experiment.

The setup

Our research directory contained 30 files: 20 research papers (DeepSeek-R1, HippoRAG 2, Self-RAG, CRAG, Adaptive RAG, Microsoft GraphRAG, and others), 4 influential blog posts (Anthropic's "Building Effective Agents", "Contextual Retrieval", Berkeley's "Compound AI Systems", and "Claude's Character"), and 6 supplementary files (our own research notes, outlines, and summaries written while studying the papers). We ingested all 30 into Vrin.

We asked both workflows the same question:

"We've given AI agents tools, planning, and reasoning capabilities. They're still unreliable in production. Based on the latest research, what is the single most underinvested layer in the current AI agent stack, and what evidence supports that?"

This is the kind of question a technical founder asks before deciding what to build next. It requires cross-document reasoning. No single paper has the answer. You need to connect benchmark results from one source with evaluation data from another, memory systems research with agent framework analysis.

Workflow 1: Local filesystem. A Claude Code agent reads local files directly. It searched across five directories, opened all 30 files, and read 336,000 characters of raw text. It synthesized the answer from everything it read.

Workflow 2: Vrin. Our retrieval system found 148 graph facts and 15 text chunks from the same 30 documents, totaling 28,000 characters of structured context. A separate Claude Code agent synthesized from that retrieved context.

Same AI agent for the final synthesis in both cases. The only difference was how the context was assembled.

Both got the same answer. Vrin found more evidence.

Both workflows concluded that the knowledge and memory infrastructure layer is the most underinvested part of the AI agent stack. Both arrived at the same strategic direction.

But Vrin's response cited 7 distinct evidence threads across the research corpus, including every key insight the filesystem found plus additional cross-paper connections.

What the filesystem found

The local workflow read all 30 files cover-to-cover and produced a strong, well-sourced answer. It cited:

  • REMem's +27.2% improvement from hybrid memory over episodic-only
  • MAGMA's thesis that "vectors do not inherently encode relationships"
  • The lack of structured memory in any major agent framework (0 of 6)
  • R3-RAG's +15pp improvement from RL-trained retrieval
  • The 1,250x cost advantage of retrieval over long context

This is what you get from careful, thorough reading of every relevant file.

What Vrin found

Vrin's response included everything the filesystem found, plus evidence the filesystem missed entirely:

REMem (+27.2%) and MAGMA ("vectors do not encode relationships"). Both present. Vrin's knowledge graph connected these papers to the query through the agent memory community cluster.

Agent-as-a-Judge: the hard numbers on agent failure. The Meta AI paper (October 2024) tested MetaGPT, GPT-Pilot, and OpenHands on 55 realistic development tasks. The best agents satisfied only 29% of task requirements. Task solve rates: 0.00% to 1.81%. The filesystem never found this despite the paper being in the corpus.

Block/Dorsey's "World Model" manifesto. Jack Dorsey and Roelof Botha's essay describing the need for "AI systems that maintain structured, queryable representations of organizational knowledge." A $40B company articulating the exact infrastructure gap. The filesystem didn't connect this to the agent reliability question.

BAIR Compound AI Systems. Berkeley's argument that "the highest-quality AI results come from compound systems with multiple components, not just monolithic models." This frames the knowledge layer as a first-class architectural concern.

MuSiQue benchmark progression. Iterative reasoning improved Exact Match from 0.377 to 0.469 (+24%), with knowledge consolidation pushing it further to 0.478. The gains came entirely from pre-inference context assembly, not model improvements.

Cross-paper connections. Vrin traced connections between REMem's memory findings, MAGMA's architectural diagnosis, the Agent-as-a-Judge evaluation gap, the BAIR compound systems thesis, and Karpathy's "small cognitive core" architecture. These connections exist in the knowledge graph as entity relationships between papers that were never written with each other in mind.

12x less text, broader evidence

FilesystemVrin
Text input to the AI agent336K chars (30 files)28K chars (148 facts + 15 chunks)
Evidence threads cited57
Research papers connectedRead individuallyConnected via entity relationships
Cross-paper connectionsManual synthesisPre-computed graph traversal

Vrin read 12x less text and found broader, more connected evidence. The filesystem read more but synthesized in isolation: each paper understood on its own, connections made by the LLM at generation time. Vrin's knowledge graph had already encoded the relationships between papers at ingestion time, so the retrieval could traverse from REMem's memory findings to MAGMA's architectural diagnosis to the Agent-as-a-Judge evaluation data in a single graph walk.

What about standard RAG?

We compared against the filesystem because it represents the upper bound: every word in every file, unlimited context. But in production, most AI systems don't get the filesystem treatment. They use standard RAG pipelines: embed documents into vectors, retrieve by cosine similarity, pass the top-K chunks to an LLM.

Standard RAG would fail this query far worse than the filesystem did.

The core problem is that cosine similarity matches surface-level text patterns, not conceptual relevance. When you ask about "the most underinvested layer in the AI agent stack," a vector search retrieves chunks containing words like "underinvested," "agent," and "stack." It has no mechanism to know that Agent-as-a-Judge (a paper about evaluation methodology) or Block's "From Hierarchy to Intelligence" (a manifesto about organizational knowledge) contains critical evidence, because those documents use entirely different vocabulary to describe the same gap.

No cross-document reasoning. RAG retrieves isolated chunks, not connections between them. REMem's finding about hybrid memory and MAGMA's thesis about vector limitations appear in separate papers. A domain expert connects these because they diagnose the same architectural problem from different angles. Standard RAG retrieves them independently (if at all) and leaves the connection-making entirely to the LLM, which may or may not notice the relationship buried in a stack of disconnected text fragments.

Cosine similarity cannot capture query nuance. "What is the most underinvested layer" is a strategic, evaluative question. It requires judgment across evidence, not lexical pattern matching. The relevant chunks don't share the query's keywords. They contain benchmark numbers, architectural critiques, and framework comparisons that collectively point toward an answer. Vector similarity operates on token-level co-occurrence patterns. It is structurally blind to this kind of conceptual convergence.

The top-K ceiling. Standard RAG retrieves the top 5, 10, maybe 20 chunks by similarity score. For a question that requires synthesizing evidence from 7 different papers, most of the relevant chunks won't survive the top-K cutoff. The retrieval is fundamentally lossy for multi-source questions. The filesystem avoids this by reading everything. Vrin avoids it by traversing the knowledge graph to find connected evidence regardless of surface similarity.

The filesystem represents the best case without structured knowledge: read it all, hope the LLM connects the dots. Standard RAG represents what production systems actually do: retrieve a handful of similar-looking fragments and hope for the best. Vrin operates in a different category: structured retrieval with retrieval-time reasoning, where the system understands what it's looking for before it searches and traces connections the query text never mentioned.

What your AI actually reasons over

Standard retrieval hands an LLM a pile of text fragments and says "figure it out." Vrin delivers structured understanding.

The filesystem gave the LLM raw document text. Research papers in their original form, scattered across files, no explicit connections between them. The LLM had to do all the intellectual heavy lifting: figure out which papers relate to each other, notice that REMem and MAGMA diagnose the same architectural gap from different angles, realize that Agent-as-a-Judge's evaluation findings explain why that gap goes undetected in production. Even frontier models miss those connections when the evidence is scattered across unstructured text with no map telling them where to look.

Vrin delivers a fundamentally different kind of context. Not text fragments retrieved by similarity. Structured intelligence, where every piece of evidence has been reasoned over, connected, and organized before the LLM generates a single token.

Evidence organized by meaning, not by source. The LLM doesn't see "here are chunks from Paper A, Paper B, Paper C." It sees "here is everything your knowledge base knows about agent memory architectures, here is everything about evaluation methodology, here is everything about knowledge infrastructure gaps." The context is organized by what it means, not which document it came from. The model knows how many knowledge clusters the evidence spans and is guided to reason across all of them. It can't accidentally fixate on one paper and miss the others.

Connections between documents are already established. Before the LLM generates a single word, Vrin has already traced how findings from different papers relate to each other. "REMem's hybrid memory results and MAGMA's vector limitation diagnosis converge on the same architectural gap" isn't something the LLM has to discover in a pile of text. It arrives as an established insight. The LLM synthesizes from a structured briefing where the connections have already been made, not from a stack of fragments where they might be hiding.

The reasoning is done before the generation starts. Vrin breaks complex queries into dependency-ordered sub-questions, retrieves targeted evidence for each, evaluates confidence at every step, and delivers the full chain of findings as a structured reasoning path. The LLM follows a pre-built argument, not a scavenger hunt through documents hoping to stumble on the right thread.

Every fact carries provenance and context. When Vrin discovers a relevant finding through multi-hop traversal, three relationships away from the original query entities, the LLM knows exactly how that evidence was found and why it's relevant. This is how the Agent-as-a-Judge paper surfaced in response to a question about "underinvested layers." Not through keyword overlap. Through entity relationships that Vrin's knowledge graph had already mapped across the entire corpus.

The filesystem gave the LLM documents. Vrin gave it understanding.

The scaling question

At 30 documents, both workflows got the right answer. The filesystem was thorough on the papers it read.

But consider what happens as the corpus grows:

  • At 30 documents, an AI agent can read everything. The filesystem works. Standard RAG retrieves a few similar chunks and already misses evidence. Karpathy's LLM Knowledge Base pattern is effective here, but only with filesystem-level access.
  • At 1,000 documents, the filesystem agent can't read them all. It has to guess which files matter. It will miss connections between documents it never opens together. Standard RAG drowns in noise, retrieving surface-similar chunks from hundreds of irrelevant documents. Both approaches break.
  • At 10,000+ documents, only structured retrieval with reasoning scales. The knowledge graph's pre-computed entity relationships become the only way to find the 15 documents (out of 10,000) that contain the connected evidence needed for a cross-domain strategic question.

Even at 30 documents, the divergence was visible. The filesystem missed the Agent-as-a-Judge findings, the Block/Dorsey manifesto connection, and the BAIR compound systems thesis. Not because they were hard to read, but because nothing in the filenames suggested they were relevant to "underinvested layers in the agent stack." Standard RAG would miss even more, because cosine similarity wouldn't surface those documents either. The knowledge graph found them because the entities were connected to the agent reliability cluster through graph edges, not keyword overlap.

What we actually learned

1. Structured retrieval finds what you didn't know to look for. The Agent-as-a-Judge paper was sitting in a clearly titled file. The filesystem agent didn't open it because its filename didn't suggest relevance to the query. The knowledge graph found it because the entities (MetaGPT, GPT-Pilot, task solve rate) were connected to the agent reliability cluster through graph edges.

2. Less context can mean better answers. The filesystem fed 336K characters to the LLM. Vrin fed 28K. The LLM produced a more connected, better-evidenced answer from 12x less text because every piece of that 28K was curated through graph traversal and confidence scoring, not grabbed by keyword similarity.

3. The bottleneck isn't the model. Both workflows used the same Claude model for synthesis. The difference was entirely in what context each workflow assembled. This validates Karpathy's core thesis: the intelligence is in how you organize and retrieve knowledge, not in the model that reasons over it.

Karpathy is right that LLMs need structured external knowledge. The open question is how that knowledge should be organized, connected, and reasoned over at query time. That's the engineering problem we're working on. And at least for this experiment, the data shows it matters.


This experiment was run in April 2026 against a corpus of 30 files: 20 research papers, 4 blog posts, and 6 supplementary research notes. Both workflows used Claude as the synthesis model. Vrin's retrieval-time reasoning pipeline retrieved 148 structured facts and 15 text chunks across 8 knowledge clusters. The filesystem workflow read all 30 files totaling 336K characters. Full responses and raw context are available on request.

Share this article
Vedant Patel
Vedant Patel

Founder & CEO

Building knowledge reasoning infrastructure for enterprise AI at VRIN. We believe in transparent research and open benchmarks.