The Right Way to Feed Your Codebase to an LLM

There's been a lot of hype about code-review-graph — an open-source MCP server that parses your codebase with Tree-sitter, builds a structural knowledge graph in local SQLite, and exposes 22 tools for AI coding agents. The pitch: point your LLM at exactly the files it needs instead of letting it guess across thousands. Token savings of 6-49x.

I wanted to know if the hype matched reality. So I tested it across 9 production repos using an agentic OpenCode workflow, and ran a head-to-head comparison against traditional AI code exploration on 14 dimensions. The results surprised me.

The Setup

Repos tested: 9 repositories spanning a real multi-repo platform — a mix of:

Repo Language Size Description
Go monorepoGo6,734 filesPrimary backend — DDD architecture, 20+ domains
Next.js frontendTypeScript733 filesCustomer-facing portal
Rails serviceRuby~500 filesBid management + Kafka consumers
PHP monolithPHP8,394 filesLegacy system (deprecating)
Video/voice appTypeScript355 filesReal-time communication (Turborepo)
PHP REST APIPHP885 filesProduct catalog API (Silex)
PHP admin APIPHP228 filesAdmin backend (Laravel Lumen)
React admin UITypeScript454 filesAdmin frontend (React 16, Redux-Saga)
Micro-frontend platformTypeScript465 files5 micro-frontends (Turborepo, Next.js)

After building graphs with post-processing: 148,000 nodes, 670,000 edges, 13,786 communities, 131 execution flows.

The comparison: I ran two approaches against all 9 repos:

  1. Explorer method: 9 parallel AI agents reading source files, configs, READMEs, routes, and schemas — the way most AI coding tools work today
  2. Graph method: Querying the code-review-graph MCP server for structural data — communities, flows, architecture health, blast radius

The Bug I Found Before I Could Even Compare

First attempt: I built all graphs via CLI (code-review-graph build), started the MCP server, and queried. Communities: 0. Flows: 0. FTS search: 0. Everything empty.

Turns out, the CLI build command and the MCP build_or_update_graph_tool run completely different pipelines:

Post-processing step CLI build MCP tool
Tree-sitter parse (nodes + edges)
Compute signatures
Rebuild FTS index
Trace execution flows
Detect communities

The CLI only parses files. The MCP tool parses files AND runs 4 post-processing steps that populate the communities, flows, and search tables. Without those steps, most of the interesting tools return nothing.

This affects the CLI build, update, watch commands, and even the bundled Claude Code hooks (which call CLI update). I wrote a workaround — a build script that calls the post-processing via Python after each CLI build — and filed an issue.

The Head-to-Head Results

Cost & Speed

Metric Explorer Graph Winner
Total tokens1,350,000152,000Graph (8.9x fewer)
API cost$11.60$7.18Graph (1.6x cheaper)
Wall clock time5m 45s3mGraph (1.9x faster)
Avg tokens per repo~104K~17KGraph (6x fewer)

Graph is significantly cheaper in tokens. But cheaper doesn't mean better — it depends on what you get back.

The 14-Dimension Scorecard

I scored both methods across 14 dimensions on a 0-10 scale:

Dimension Explorer Graph Gap
Tech stack & versions102Explorer +8
API surface100Explorer +10
Database schemas90Explorer +9
Messaging (Kafka/SNS)90Explorer +9
External dependencies100Explorer +10
Infrastructure90Explorer +9
Auth flows93Explorer +6
Business logic90Explorer +9
Deprecation awareness91Explorer +8
Code structure89Graph +1
Complexity hotspots79Graph +2
Architectural health59Graph +4
Test coverage map68Graph +2
Execution flow tracing08Graph +8
TOTAL110/14049/140
78.6%35.0%

Explorer wins 78.6% to 35.0%. Not close.

But that headline number misses the point. Look at where each method scored 0.

What Only the Graph Found

These are things no amount of file reading would surface:

  • One frontend had 77 architectural coupling warnings with a coupling ratio of 8.8 — by far the most structurally unhealthy repo. The explorer called it “well-organized.”
  • A React admin UI had a component with criticality score 0.92 — the single most critical execution flow across all 9 repos. The explorer described it as just another feature.
  • A video call component was 4,295 lines — an extreme complexity hotspot that nobody had flagged.
  • The Go monorepo had a coupling ratio of 0.04 — quantitative proof that its DDD domain architecture was actually working. The explorer described the architecture qualitatively, but couldn’t put a number on it.
  • Two deprecated PHP repos had exactly 0 cross-community edges — they’re structurally dead. Useful data for a decommission decision.
  • 131 execution flows across 8 repos with depth and criticality scores — entirely invisible to file-reading agents.

What Only the Explorer Found

These are things a structural graph will never capture:

  • The Go monorepo has 20 named DDD domains, each with specific purpose, database assignments, and Kafka roles
  • One service is mid-migration — storage moving from monolith to a new microservice, gated by a feature flag
  • A frontend calls 8 distinct backend services via Server Actions
  • The legacy monolith runs 5 web servers on ports 80-84 serving different subdomains
  • A PHP API has 6 sub-applications each on a different port
  • One admin UI bootstraps auth from a PHP iframe session — tokens for 3 APIs
  • A micro-frontend platform has dual auth systems — modern OAuth for new routes, legacy session cookies for iframe routes
  • 12 database schemas across services, plus DynamoDB and Snowflake
  • 12 SNS topics and 10 Kafka consumer topics with local table targets
  • A complete cross-repo platform architecture diagram — which services talk to which, through what protocols

Where the Graph Broke Down

  • PHP support is broken. The PHP monolith had 52,000 nodes parsed, but detected flows were all in vendored TinyMCE JavaScript. Zero useful PHP application flows. Tree-sitter’s PHP CALLS edge extraction produced near-zero results (22 edges in one PHP repo, 0 in another).
  • No config/infrastructure awareness. Blind to Helm charts, Docker configs, CI/CD pipelines, environment variables. The graph lives entirely in source code ASTs.
  • No semantic understanding. It knows jwt is a function. It doesn’t know that function validates Okta tokens via OIDC.
  • Massive output sizes. Architecture overviews for larger repos were 1-9 MB of JSON — too large for an LLM context window. I had to delegate to a summarization agent just to make them usable.
  • No cross-repo awareness. Each repo is an isolated graph. Can’t trace a call from the backend to the frontend.

When to Use Which

Scenario Best Method Why
“What does this repo do?”ExplorerBusiness logic, APIs, features, purpose
“How healthy is the architecture?”GraphCoupling ratios, warnings, community boundaries
“What tech stack?”ExplorerVersions, frameworks, packages
“Where are the complexity hotspots?”GraphLine counts, criticality scores, flow depths
“What databases/services does it use?”ExplorerConfig parsing, connection strings, env vars
“Which execution paths are most critical?”GraphFlow criticality scoring
“Is this change safe?” (blast radius)GraphImpact radius, affected flows, affected communities
“Onboard a new engineer”ExplorerFull context — what, why, how, where
“Prioritize refactoring targets”GraphCoupling ratios + warning counts = data-driven priorities
“Full architecture review”BothGraph for structure + Explorer for semantics

The Optimal Strategy

Neither method alone gives you the full picture. The best approach I found:

Step 1: Graph first (30s, ~30K tokens, ~$0.18)
  → Structural baseline: community map, coupling ratios, flows, hotspots
  → Identify which repos need deep exploration and where to focus

Step 2: Targeted exploration (5m, ~933K tokens, ~$3.80)
  → Pre-load graph findings as context for each explorer agent
  → "This repo has 77 coupling warnings — find what's causing them"
  → Skip repos with 0 cross-community edges (confirmed dead)

Step 3: Synthesize
  → Merge structural precision with semantic richness

The graph tells you WHERE to look. Exploration tells you WHAT’s there. Together they’re stronger than either alone.

My Verdict

code-review-graph is a precision tool for structural analysis — not a general-purpose context replacement. It genuinely shines at architectural health metrics, complexity detection, and blast radius analysis. These are things traditional file-reading agents literally cannot do.

But it scored 35% vs 79% on a comprehensive comparison. It’s blind to APIs, databases, messaging, infrastructure, auth flows, and business logic — which is most of what matters for daily development work.

I’m keeping it on my machine for code reviews and refactoring prioritization. The blast radius and coupling ratio features alone justify the setup. But I wouldn’t pitch it to a team as a productivity game-changer — it’s a specialized tool with a specific sweet spot.

Setup tip: If you’re using any platform other than Claude Code (or even Claude Code via CLI), you’ll hit the post-processing bug. After code-review-graph build, call build_or_update_graph_tool(full_rebuild=True) via the MCP server, or add the post-processing steps manually. Otherwise you’ll get empty results for the most interesting features.