The Right Way to Feed Your Codebase to an LLM
There's been a lot of hype about code-review-graph — an open-source MCP server that parses your codebase with Tree-sitter, builds a structural knowledge graph in local SQLite, and exposes 22 tools for AI coding agents. The pitch: point your LLM at exactly the files it needs instead of letting it guess across thousands. Token savings of 6-49x.
I wanted to know if the hype matched reality. So I tested it across 9 production repos using an agentic OpenCode workflow, and ran a head-to-head comparison against traditional AI code exploration on 14 dimensions. The results surprised me.
The Setup
Repos tested: 9 repositories spanning a real multi-repo platform — a mix of:
| Repo | Language | Size | Description |
|---|---|---|---|
| Go monorepo | Go | 6,734 files | Primary backend — DDD architecture, 20+ domains |
| Next.js frontend | TypeScript | 733 files | Customer-facing portal |
| Rails service | Ruby | ~500 files | Bid management + Kafka consumers |
| PHP monolith | PHP | 8,394 files | Legacy system (deprecating) |
| Video/voice app | TypeScript | 355 files | Real-time communication (Turborepo) |
| PHP REST API | PHP | 885 files | Product catalog API (Silex) |
| PHP admin API | PHP | 228 files | Admin backend (Laravel Lumen) |
| React admin UI | TypeScript | 454 files | Admin frontend (React 16, Redux-Saga) |
| Micro-frontend platform | TypeScript | 465 files | 5 micro-frontends (Turborepo, Next.js) |
After building graphs with post-processing: 148,000 nodes, 670,000 edges, 13,786 communities, 131 execution flows.
The comparison: I ran two approaches against all 9 repos:
- Explorer method: 9 parallel AI agents reading source files, configs, READMEs, routes, and schemas — the way most AI coding tools work today
- Graph method: Querying the code-review-graph MCP server for structural data — communities, flows, architecture health, blast radius
The Bug I Found Before I Could Even Compare
First attempt: I built all graphs via CLI (code-review-graph build), started the MCP server, and queried. Communities: 0. Flows: 0. FTS search: 0. Everything empty.
Turns out, the CLI build command and the MCP build_or_update_graph_tool run completely different pipelines:
| Post-processing step | CLI build |
MCP tool |
|---|---|---|
| Tree-sitter parse (nodes + edges) | ✓ | ✓ |
| Compute signatures | ✗ | ✓ |
| Rebuild FTS index | ✗ | ✓ |
| Trace execution flows | ✗ | ✓ |
| Detect communities | ✗ | ✓ |
The CLI only parses files. The MCP tool parses files AND runs 4 post-processing steps that populate the communities, flows, and search tables. Without those steps, most of the interesting tools return nothing.
This affects the CLI build, update, watch commands, and even the bundled Claude Code hooks (which call CLI update). I wrote a workaround — a build script that calls the post-processing via Python after each CLI build — and filed an issue.
The Head-to-Head Results
Cost & Speed
| Metric | Explorer | Graph | Winner |
|---|---|---|---|
| Total tokens | 1,350,000 | 152,000 | Graph (8.9x fewer) |
| API cost | $11.60 | $7.18 | Graph (1.6x cheaper) |
| Wall clock time | 5m 45s | 3m | Graph (1.9x faster) |
| Avg tokens per repo | ~104K | ~17K | Graph (6x fewer) |
Graph is significantly cheaper in tokens. But cheaper doesn't mean better — it depends on what you get back.
The 14-Dimension Scorecard
I scored both methods across 14 dimensions on a 0-10 scale:
| Dimension | Explorer | Graph | Gap |
|---|---|---|---|
| Tech stack & versions | 10 | 2 | Explorer +8 |
| API surface | 10 | 0 | Explorer +10 |
| Database schemas | 9 | 0 | Explorer +9 |
| Messaging (Kafka/SNS) | 9 | 0 | Explorer +9 |
| External dependencies | 10 | 0 | Explorer +10 |
| Infrastructure | 9 | 0 | Explorer +9 |
| Auth flows | 9 | 3 | Explorer +6 |
| Business logic | 9 | 0 | Explorer +9 |
| Deprecation awareness | 9 | 1 | Explorer +8 |
| Code structure | 8 | 9 | Graph +1 |
| Complexity hotspots | 7 | 9 | Graph +2 |
| Architectural health | 5 | 9 | Graph +4 |
| Test coverage map | 6 | 8 | Graph +2 |
| Execution flow tracing | 0 | 8 | Graph +8 |
| TOTAL | 110/140 | 49/140 | |
| 78.6% | 35.0% |
Explorer wins 78.6% to 35.0%. Not close.
But that headline number misses the point. Look at where each method scored 0.
What Only the Graph Found
These are things no amount of file reading would surface:
- One frontend had 77 architectural coupling warnings with a coupling ratio of 8.8 — by far the most structurally unhealthy repo. The explorer called it “well-organized.”
- A React admin UI had a component with criticality score 0.92 — the single most critical execution flow across all 9 repos. The explorer described it as just another feature.
- A video call component was 4,295 lines — an extreme complexity hotspot that nobody had flagged.
- The Go monorepo had a coupling ratio of 0.04 — quantitative proof that its DDD domain architecture was actually working. The explorer described the architecture qualitatively, but couldn’t put a number on it.
- Two deprecated PHP repos had exactly 0 cross-community edges — they’re structurally dead. Useful data for a decommission decision.
- 131 execution flows across 8 repos with depth and criticality scores — entirely invisible to file-reading agents.
What Only the Explorer Found
These are things a structural graph will never capture:
- The Go monorepo has 20 named DDD domains, each with specific purpose, database assignments, and Kafka roles
- One service is mid-migration — storage moving from monolith to a new microservice, gated by a feature flag
- A frontend calls 8 distinct backend services via Server Actions
- The legacy monolith runs 5 web servers on ports 80-84 serving different subdomains
- A PHP API has 6 sub-applications each on a different port
- One admin UI bootstraps auth from a PHP iframe session — tokens for 3 APIs
- A micro-frontend platform has dual auth systems — modern OAuth for new routes, legacy session cookies for iframe routes
- 12 database schemas across services, plus DynamoDB and Snowflake
- 12 SNS topics and 10 Kafka consumer topics with local table targets
- A complete cross-repo platform architecture diagram — which services talk to which, through what protocols
Where the Graph Broke Down
- PHP support is broken. The PHP monolith had 52,000 nodes parsed, but detected flows were all in vendored TinyMCE JavaScript. Zero useful PHP application flows. Tree-sitter’s PHP CALLS edge extraction produced near-zero results (22 edges in one PHP repo, 0 in another).
- No config/infrastructure awareness. Blind to Helm charts, Docker configs, CI/CD pipelines, environment variables. The graph lives entirely in source code ASTs.
- No semantic understanding. It knows
jwtis a function. It doesn’t know that function validates Okta tokens via OIDC. - Massive output sizes. Architecture overviews for larger repos were 1-9 MB of JSON — too large for an LLM context window. I had to delegate to a summarization agent just to make them usable.
- No cross-repo awareness. Each repo is an isolated graph. Can’t trace a call from the backend to the frontend.
When to Use Which
| Scenario | Best Method | Why |
|---|---|---|
| “What does this repo do?” | Explorer | Business logic, APIs, features, purpose |
| “How healthy is the architecture?” | Graph | Coupling ratios, warnings, community boundaries |
| “What tech stack?” | Explorer | Versions, frameworks, packages |
| “Where are the complexity hotspots?” | Graph | Line counts, criticality scores, flow depths |
| “What databases/services does it use?” | Explorer | Config parsing, connection strings, env vars |
| “Which execution paths are most critical?” | Graph | Flow criticality scoring |
| “Is this change safe?” (blast radius) | Graph | Impact radius, affected flows, affected communities |
| “Onboard a new engineer” | Explorer | Full context — what, why, how, where |
| “Prioritize refactoring targets” | Graph | Coupling ratios + warning counts = data-driven priorities |
| “Full architecture review” | Both | Graph for structure + Explorer for semantics |
The Optimal Strategy
Neither method alone gives you the full picture. The best approach I found:
Step 1: Graph first (30s, ~30K tokens, ~$0.18)
→ Structural baseline: community map, coupling ratios, flows, hotspots
→ Identify which repos need deep exploration and where to focus
Step 2: Targeted exploration (5m, ~933K tokens, ~$3.80)
→ Pre-load graph findings as context for each explorer agent
→ "This repo has 77 coupling warnings — find what's causing them"
→ Skip repos with 0 cross-community edges (confirmed dead)
Step 3: Synthesize
→ Merge structural precision with semantic richness
The graph tells you WHERE to look. Exploration tells you WHAT’s there. Together they’re stronger than either alone.
My Verdict
code-review-graph is a precision tool for structural analysis — not a general-purpose context replacement. It genuinely shines at architectural health metrics, complexity detection, and blast radius analysis. These are things traditional file-reading agents literally cannot do.
But it scored 35% vs 79% on a comprehensive comparison. It’s blind to APIs, databases, messaging, infrastructure, auth flows, and business logic — which is most of what matters for daily development work.
I’m keeping it on my machine for code reviews and refactoring prioritization. The blast radius and coupling ratio features alone justify the setup. But I wouldn’t pitch it to a team as a productivity game-changer — it’s a specialized tool with a specific sweet spot.
Setup tip: If you’re using any platform other than Claude Code (or even Claude Code via CLI), you’ll hit the post-processing bug. After code-review-graph build, call build_or_update_graph_tool(full_rebuild=True) via the MCP server, or add the post-processing steps manually. Otherwise you’ll get empty results for the most interesting features.