The Right Way to Feed Your Codebase to an LLM

April 25, 2026 · Gagan Kalra

There's been a lot of hype about code-review-graph — an open-source MCP server that parses your codebase with Tree-sitter, builds a structural knowledge graph in local SQLite, and exposes 22 tools for AI coding agents. The pitch: point your LLM at exactly the files it needs instead of letting it guess across thousands. Token savings of 6-49x.

I wanted to know if the hype matched reality. So I tested it across 9 production repos using an agentic OpenCode workflow, and ran a head-to-head comparison against traditional AI code exploration on 14 dimensions. The results surprised me.

The Setup

Repos tested: 9 repositories spanning a real multi-repo platform — a mix of:

Repo	Language	Size	Description
Go monorepo	Go	6,734 files	Primary backend — DDD architecture, 20+ domains
Next.js frontend	TypeScript	733 files	Customer-facing portal
Rails service	Ruby	~500 files	Bid management + Kafka consumers
PHP monolith	PHP	8,394 files	Legacy system (deprecating)
Video/voice app	TypeScript	355 files	Real-time communication (Turborepo)
PHP REST API	PHP	885 files	Product catalog API (Silex)
PHP admin API	PHP	228 files	Admin backend (Laravel Lumen)
React admin UI	TypeScript	454 files	Admin frontend (React 16, Redux-Saga)
Micro-frontend platform	TypeScript	465 files	5 micro-frontends (Turborepo, Next.js)

After building graphs with post-processing: 148,000 nodes, 670,000 edges, 13,786 communities, 131 execution flows.

The comparison: I ran two approaches against all 9 repos:

Explorer method: 9 parallel AI agents reading source files, configs, READMEs, routes, and schemas — the way most AI coding tools work today
Graph method: Querying the code-review-graph MCP server for structural data — communities, flows, architecture health, blast radius

The Bug I Found Before I Could Even Compare

First attempt: I built all graphs via CLI (code-review-graph build), started the MCP server, and queried. Communities: 0. Flows: 0. FTS search: 0. Everything empty.

Turns out, the CLI build command and the MCP build_or_update_graph_tool run completely different pipelines:

Post-processing step	CLI `build`	MCP tool
Tree-sitter parse (nodes + edges)	✓	✓
Compute signatures	✗	✓
Rebuild FTS index	✗	✓
Trace execution flows	✗	✓
Detect communities	✗	✓

The CLI only parses files. The MCP tool parses files AND runs 4 post-processing steps that populate the communities, flows, and search tables. Without those steps, most of the interesting tools return nothing.

This affects the CLI build, update, watch commands, and even the bundled Claude Code hooks (which call CLI update). I wrote a workaround — a build script that calls the post-processing via Python after each CLI build — and filed an issue.

The Head-to-Head Results

Cost & Speed

Metric	Explorer	Graph	Winner
Total tokens	1,350,000	152,000	Graph (8.9x fewer)
API cost	$11.60	$7.18	Graph (1.6x cheaper)
Wall clock time	5m 45s	3m	Graph (1.9x faster)
Avg tokens per repo	~104K	~17K	Graph (6x fewer)

Graph is significantly cheaper in tokens. But cheaper doesn't mean better — it depends on what you get back.

The 14-Dimension Scorecard

I scored both methods across 14 dimensions on a 0-10 scale:

Dimension	Explorer	Graph	Gap
Tech stack & versions	10	2	Explorer +8
API surface	10	0	Explorer +10
Database schemas	9	0	Explorer +9
Messaging (Kafka/SNS)	9	0	Explorer +9
External dependencies	10	0	Explorer +10
Infrastructure	9	0	Explorer +9
Auth flows	9	3	Explorer +6
Business logic	9	0	Explorer +9
Deprecation awareness	9	1	Explorer +8
Code structure	8	9	Graph +1
Complexity hotspots	7	9	Graph +2
Architectural health	5	9	Graph +4
Test coverage map	6	8	Graph +2
Execution flow tracing	0	8	Graph +8

TOTAL	110/140	49/140
	78.6%	35.0%

Explorer wins 78.6% to 35.0%. Not close.

But that headline number misses the point. Look at where each method scored 0.

What Only the Graph Found

These are things no amount of file reading would surface:

One frontend had 77 architectural coupling warnings with a coupling ratio of 8.8 — by far the most structurally unhealthy repo. The explorer called it “well-organized.”
A React admin UI had a component with criticality score 0.92 — the single most critical execution flow across all 9 repos. The explorer described it as just another feature.
A video call component was 4,295 lines — an extreme complexity hotspot that nobody had flagged.
The Go monorepo had a coupling ratio of 0.04 — quantitative proof that its DDD domain architecture was actually working. The explorer described the architecture qualitatively, but couldn’t put a number on it.
Two deprecated PHP repos had exactly 0 cross-community edges — they’re structurally dead. Useful data for a decommission decision.
131 execution flows across 8 repos with depth and criticality scores — entirely invisible to file-reading agents.

What Only the Explorer Found

These are things a structural graph will never capture:

The Go monorepo has 20 named DDD domains, each with specific purpose, database assignments, and Kafka roles
One service is mid-migration — storage moving from monolith to a new microservice, gated by a feature flag
A frontend calls 8 distinct backend services via Server Actions
The legacy monolith runs 5 web servers on ports 80-84 serving different subdomains
A PHP API has 6 sub-applications each on a different port
One admin UI bootstraps auth from a PHP iframe session — tokens for 3 APIs
A micro-frontend platform has dual auth systems — modern OAuth for new routes, legacy session cookies for iframe routes
12 database schemas across services, plus DynamoDB and Snowflake
12 SNS topics and 10 Kafka consumer topics with local table targets
A complete cross-repo platform architecture diagram — which services talk to which, through what protocols

Where the Graph Broke Down

PHP support is broken. The PHP monolith had 52,000 nodes parsed, but detected flows were all in vendored TinyMCE JavaScript. Zero useful PHP application flows. Tree-sitter’s PHP CALLS edge extraction produced near-zero results (22 edges in one PHP repo, 0 in another).
No config/infrastructure awareness. Blind to Helm charts, Docker configs, CI/CD pipelines, environment variables. The graph lives entirely in source code ASTs.
No semantic understanding. It knows jwt is a function. It doesn’t know that function validates Okta tokens via OIDC.
Massive output sizes. Architecture overviews for larger repos were 1-9 MB of JSON — too large for an LLM context window. I had to delegate to a summarization agent just to make them usable.
No cross-repo awareness. Each repo is an isolated graph. Can’t trace a call from the backend to the frontend.

When to Use Which

Scenario	Best Method	Why
“What does this repo do?”	Explorer	Business logic, APIs, features, purpose
“How healthy is the architecture?”	Graph	Coupling ratios, warnings, community boundaries
“What tech stack?”	Explorer	Versions, frameworks, packages
“Where are the complexity hotspots?”	Graph	Line counts, criticality scores, flow depths
“What databases/services does it use?”	Explorer	Config parsing, connection strings, env vars
“Which execution paths are most critical?”	Graph	Flow criticality scoring
“Is this change safe?” (blast radius)	Graph	Impact radius, affected flows, affected communities
“Onboard a new engineer”	Explorer	Full context — what, why, how, where
“Prioritize refactoring targets”	Graph	Coupling ratios + warning counts = data-driven priorities
“Full architecture review”	Both	Graph for structure + Explorer for semantics

The Optimal Strategy

Neither method alone gives you the full picture. The best approach I found:

Step 1: Graph first (30s, ~30K tokens, ~$0.18)
  → Structural baseline: community map, coupling ratios, flows, hotspots
  → Identify which repos need deep exploration and where to focus

Step 2: Targeted exploration (5m, ~933K tokens, ~$3.80)
  → Pre-load graph findings as context for each explorer agent
  → "This repo has 77 coupling warnings — find what's causing them"
  → Skip repos with 0 cross-community edges (confirmed dead)

Step 3: Synthesize
  → Merge structural precision with semantic richness

The graph tells you WHERE to look. Exploration tells you WHAT’s there. Together they’re stronger than either alone.

My Verdict

code-review-graph is a precision tool for structural analysis — not a general-purpose context replacement. It genuinely shines at architectural health metrics, complexity detection, and blast radius analysis. These are things traditional file-reading agents literally cannot do.

But it scored 35% vs 79% on a comprehensive comparison. It’s blind to APIs, databases, messaging, infrastructure, auth flows, and business logic — which is most of what matters for daily development work.

I’m keeping it on my machine for code reviews and refactoring prioritization. The blast radius and coupling ratio features alone justify the setup. But I wouldn’t pitch it to a team as a productivity game-changer — it’s a specialized tool with a specific sweet spot.

Setup tip: If you’re using any platform other than Claude Code (or even Claude Code via CLI), you’ll hit the post-processing bug. After code-review-graph build, call build_or_update_graph_tool(full_rebuild=True) via the MCP server, or add the post-processing steps manually. Otherwise you’ll get empty results for the most interesting features.

ai-code-review mcp-server code-review-graph ai-coding-tools tree-sitter llm-token-optimization opencode developer-productivity static-analysis agentic-workflow