Graph Salience + Agentic Context

Sing in me, Muse, and through me tell the story...

TL;DR

Understanding a code base is crucial for an agent to operate on it appropriately, but most systems struggle to provide that understanding. A utility module half the codebase depends on but hasn't been touched in a year is invisible to behavioral analysis. It's arguably the most important thing for an agent to understand, because the blast radius of a bad change is enormous. Homer combines structural graph analysis (PageRank, betweenness, HITS on scope-graph-derived call graphs) with behavioral signals (churn, bus factor, co-change patterns) into composite salience scores, adds meaningful classification, and tracks how those scores drift over time. MCP tools let agents query risk, co-changes, and conventions before modifying code.

The Invisible Code

I've been watching agents break the same kind of code for months. Not the code that's changing. Not the code with known bugs. The code that hasn't been touched in a year, that sits at the center of the dependency graph, that sixteen modules import without thinking about it. An agent rewrites a utility function in that file because the type signature looks improvable. Tests pass locally. Then the downstream failures start cascading.

The standard fix that most people reach for is more context. Bigger windows, better retrieval, AGENTS.md files that explain what matters. The AGENTS.md ecosystem has real traction now (60k+ projects, measurable improvements in agent performance), and I take it seriously enough that Homer generates them.Homer produces AGENTS.md files with a structured view of key analysis: architecture from Louvain clusters, load-bearing code ranked by salience, danger zones (high churn + low bus factor), detected conventions, areas that confuse agents (correction hotspots from fix-after-AI patterns), domain vocabulary. But the generated file is a snapshot. The MCP tools are a live query interface. The tools are more interesting.

But AGENTS.md tells agents what code does. It doesn't tell them what code matters.

Adam Tornhill's work is the strongest prior art. His insight that version control history encodes technical debt, hotspots, and organizational risk is genuinely valuable. CodeScene operationalized this for human teams and it works. His "Code for Machines, Not Just Humans" paper found that AI-assisted changes increase defect risk by roughly 30% on code with existing health problems.That 30% number should bother you. It means agents are disproportionately likely to make things worse in exactly the places where things are already bad. The implication: agents need to know what code is fragile before they touch it, not after CI turns red.

But behavioral analysis finds what changes. It doesn't find what matters structurally. Consider a module imported by 200 files across your codebase, untouched for 14 months, written by someone who left the company. Churn analysis sees nothing. Recency analysis sees nothing. Bus factor might flag it, but only as one signal among many.

Structurally, that module is foundational stable code. The dependency graph knows it's load-bearing even when the git log has forgotten it exists. An agent that modifies this file without understanding its structural position is playing a game it doesn't know it's playing.

Composite Salience

Instead of asking "what changed recently?" we need to ask "what matters structurally, and how does that interact with what's changing?"

Salience in Homer is a weighted composite of structural and behavioral signals. Three graph centrality measures, four behavioral measures from git history. The structural side captures different aspects of graph importance:

PageRank (30% weight) measures transitive importance on the call graph. A file imported by three modules that are each imported by dozens more scores higher than a file imported directly by dozens of leaf modules. Importance propagates. Same intuition as the original web ranking, applied to petgraph::algo::page_rank over the projected dependency graph.

Betweenness centrality (15%) via Brandes' algorithm finds bridge code: files on the shortest paths between otherwise disconnected subsystems. High betweenness means "break this and you sever the connection between parts of the codebase that communicate through it." Utility modules, shared interfaces, adapter layers.

HITS authority (15%) distinguishes hubs from authorities. A hub imports many things. An authority is imported by many things. The distinction matters for risk: modifying an authority (a widely-used interface) has different blast radius characteristics than modifying a hub (an orchestration module).

The behavioral side captures what the graph can't:

Change frequency (15%) is percentile-normalized. Raw commit counts are misleading (a file touched once in every PR isn't the same as a file touched once per quarter). Percentile ranking against the full repo normalizes across different project cadences.

Bus factor risk (10%) is the inverse of contributor count. A file with one contributor has bus factor risk of 1.0. A file with five contributors has risk of 0.2. Simple, but effective at surfacing knowledge concentration.

Code size (5%) is normalized file size. Larger files carry more surface area for breakage. This is the weakest signal, deliberately down-weighted.

Test presence (10%) is binary: does this file have associated tests? Untested code with high structural centrality is a specific kind of danger.

All inputs are normalized to [0, 1]. The composite:

salience = pagerank × 0.30
         + betweenness × 0.15
         + hits_authority × 0.15
         + churn_percentile × 0.15
         + (1 / bus_factor) × 0.10
         + size_normalized × 0.05
         + has_tests × 0.10

The salience score gates everything downstream. On a repo with 10,000 functions, maybe 200 clear the threshold for expensive analysis (LLM-powered summarization, semantic clustering). That's the most important performance decision in the system: use cheap structural analysis to filter, then spend LLM budget only on what the graph says matters.weightsThe weights are hand-tuned defaults, fully configurable per project. I'm not going to pretend they're validated. They come from testing on my own repos and repos of people who've given me feedback. Learned weights from agent outcome data would be real validation.

The composite maps every file into a four-quadrant classification:

	High Churn	Low Churn
High Centrality	ActiveHotspot	FoundationalStable
Low Centrality	PeripheralActive	QuietLeaf

ActiveHotspot is your highest-risk quadrant: structurally central and in flux. PeripheralActive is usually safe: feature code, tests, scripts changing at the edges. QuietLeaf is low risk, low attention.FoundationalStable is the quadrant that should worry you most. Load-bearing, under-documented, often owned by people who've left. Behavioral analysis can't see it because nothing is happening. Structural analysis sees it clearly because the dependency graph doesn't forget. When an agent finally touches one of these files, the blast radius is large and the person who understood the invariants is gone.

Precise Code Analysis

Homer has two tiers of code analysis per language.

The heuristic tier uses tree-sitter queries to extract function definitions, calls, and imports. It works within a single file. Every language gets this as a baseline, and it's fast.

The precise tier builds actual scope graphs. This is where the real work happens. For each file, Homer constructs a FileScopeGraph with four node types: PushSymbol (a reference to a name), PopSymbol (a definition that binds a name), Scope (a boundary like a function, block, or module), and ExportScope / ImportScope (boundary nodes for cross-file resolution).scope graphsThe intellectual lineage runs from Visser's "A Theory of Name Resolution" through GitHub's stack graphs project (now archived). Visser formalized how programming languages resolve names through nested scopes. GitHub built stack graphs on that theory for code navigation. The project is archived now. Homer uses tree-sitter directly but the scope graph model is the same: definitions, references, scopes, and the edges between them. Path-stitching across scope boundaries resolves references to their definitions.

The scope graph for a single file captures which names are defined, which names are referenced, and which scopes contain which other scopes. The interesting part is cross-file resolution: Homer stitches scope graphs together through import/export boundary nodes. When module_a.rs imports calculate from module_b.rs, the PushSymbol in A connects through the import/export boundary to the PopSymbol in B. The reference resolves to its definition across file boundaries.

From these resolved scope graphs, Homer projects two derived graphs. The call graph maps function-to-function relationships. The import graph maps file-to-file dependencies through resolved symbol references, not string matching on import paths. These projections feed into petgraph DiGraph structures where the centrality algorithms run.

Six languages have precise tier implementations: Rust, Python, TypeScript, JavaScript, Go, Java. Each has its own tree-sitter grammar and language-specific scope rules. The Rust extractor knows that mod declarations create scope boundaries and that pub(crate) restricts export scope. The Python extractor handles from x import y differently than import x. The TypeScript extractor understands export and import type as separate from value imports. These are language-specific scope graph builders with real awareness of each language's name resolution rules.

Temporal Analysis

Static analysis gives you a snapshot. The dependency graph right now. The salience scores right now. That's useful but it misses drift.

Homer tracks how salience scores change over time. Each analysis run can snapshot its results, and the temporal analyzer compares across snapshots. The mechanism is straightforward: for each file, Homer keeps the last 10 composite salience scores in a score_history array, then runs linear regression on the (index, score) pairs. The slope tells you the trend.The classification thresholds are ±0.01 slope. Above 0.01 is "Increasing" (this file is becoming more structurally central over time). Below -0.01 is "Decreasing." In between is "Stable." These thresholds are, like the salience weights, hand-tuned. The temporal analyzer also keeps previous community assignments from Louvain clustering and computes Jaccard similarity against current assignments. When community boundaries shift, that's architectural drift.

A file whose salience is increasing over time is accumulating structural importance. More things are depending on it. More call paths are routing through it. If that file is also in the FoundationalStable quadrant (high centrality, low churn), the temporal trend is a warning: this file is becoming more load-bearing while remaining less actively maintained. That's a risk trajectory that no single-snapshot analysis captures.

Architectural drift detection works similarly. Homer runs Louvain community detection (multi-level, with graph contraction) to find natural module boundaries in the dependency graph, then compares those boundaries against the previous run's communities via Jaccard similarity on membership sets. When the algorithmic communities diverge from the directory structure, or when community boundaries shift between runs, that's a signal that the codebase's organizational structure doesn't match its structural dependencies. The code is evolving in a direction the directory layout no longer reflects.LouvainThe implementation is multi-level: greedy local modularity moves, then graph contraction (merge all nodes in the same community into a super-node, sum edge weights), iterate up to 10 levels. Treats directed graphs as undirected for modularity computation. The output is community assignments that Homer checks against directory prefixes. When a community's members span three unrelated directories, that's a cross-cutting concern worth flagging.

Co-Change Detection

The behavioral analysis that I find most useful in practice is co-change detection. Not "what files have high churn" but "what files change together."

Homer uses a seed-and-grow algorithm. Start with file pairs that have strong co-occurrence in commits (the seeds). Compute Jaccard similarity: out of all commits that touch file A or file B, what fraction touch both? High Jaccard means strong coupling. Then grow: for each seed pair, look at what other files consistently appear in the same commits. Filter to groups with at least three members and 0.3 average confidence. The result is N-ary co-change groups, not just pairs, stored as CoChanges hyperedges with confidence scores derived from the Jaccard coefficient.

This is intentionally not frequent-itemset mining. The goal is actionable change sets ("when you touch A, you probably need to touch B and C"), not exhaustive pattern discovery.The seed threshold (0.5 Jaccard) and growth threshold (0.3) are configurable. In practice, I've found that the seed threshold is about right but the growth threshold sometimes pulls in files that are co-changing by coincidence rather than necessity. Tightening it to 0.35 helps on larger repos where coincidental co-occurrence is more common.

When an agent queries homer_co_changes for a file it's about to modify, it gets back the files that historically change alongside it, with confidence scores. "You're about to modify auth.rs. In 80% of past commits that touched auth.rs, session.rs and middleware.rs also changed." The agent knows to check those files. That's a concrete, actionable signal that no amount of static analysis provides.

The Hypergraph

The storage layer deserves a paragraph because the data model shapes what analyses are possible.

Homer stores everything in a SQLite-backed hypergraph with 15 node kinds (File, Function, Type, Module, Commit, Contributor, Release, PullRequest, Issue, Document, and more) and 18 hyperedge kinds (Calls, Imports, Inherits, Modifies, Authored, Reviewed, CoChanges, ClusterMembers, and more). The key table is hyperedge_members, which maps edges to nodes with roles and positions. One CoChanges edge connecting five files is five rows in that table, not ten pairwise edges in a traditional graph.Hyperedges use semantic identity_key for idempotent upserts. If you re-run analysis and the same co-change group is detected, the edge updates in place rather than duplicating. Content hashing on nodes handles incremental change detection. SQLite runs in WAL mode for concurrent reads during MCP serving.

This is a more natural representation for code relationships. A function called by three modules is one hyperedge with three members, not three separate edges. "Commit C modified files " is one edge, not six. Community membership is one ClusterMembers edge per community, connecting all member files.

A design principle that matters here: extraction and analysis errors on individual files don't abort the pipeline. A parse error in one file shouldn't prevent analysis of the other 10,000 files. Homer processes what it can, reports what it couldn't, and produces partial results. Real repositories are messy. The system has to be pragmatic about that.

The Agent Interface

Homer's output isn't a report. It's an MCP server with five tools that agents query at decision time.

homer_risk is the one I use most. It returns the composite salience score, quadrant classification, bus factor, change frequency, community membership, and a rolled-up risk level (low/medium/high/critical) for any file. The risk computation stacks: salience above 0.7 adds 3 points, bus factor of 1 adds 2, change frequency above 20 adds 2. Six or more points is critical. The agent gets a number and a reason.

homer_co_changes returns the co-change groups for a file, filtered by confidence threshold. Default is 0.3. Returns the member files, co-occurrence count, support metric, and the N-ary hyperedge arity.

homer_conventions returns detected coding patterns: naming, testing, error handling, documentation, agent rules. Scoped by module or project-wide. This one is heuristic and I want to be honest about that. It's pattern-matching over the codebase, useful but imperfect.

homer_graph exposes raw centrality queries. Top files by PageRank, betweenness, HITS, or composite salience. Scope-filterable by path prefix. When you want to know "what are the most structurally important files in src/auth/?" this is the tool.

homer_query is entity lookup. Find a function or type by name, get its callers, callees, analysis data, and the history of changes to it.MCPHomer was designed MCP-first. The server uses rmcp with Tokio, and the five tools are the primary interface. AGENTS.md generation and CLI queries are secondary outputs from the same analysis pipeline.

One direction I'm exploring: mining agent interaction history itself. When developers use AI coding agents, their conversations contain a layer of knowledge that exists nowhere else. Task descriptions reveal what developers actually work on, in their own words. Corrections reveal where the codebase confuses agents. File references in prompts reveal what humans think is relevant context. This is essentially a record of the human-codebase interface, and it's extractable. Homer's prompt extractor (opt-in, disabled by default) mines Claude Code, Cursor, and similar agent session logs for these signals.

Installation

homer init .                    # Extract and analyze the repository
homer update .                  # Incremental update after changes
homer serve                     # Start the MCP server

Add to your Claude Code MCP config:

{
  "mcpServers": {
    "homer": {
      "command": "homer",
      "args": ["serve"]
    }
  }
}

Rough Edges

Cold start is slow. The initial homer init on a large repository takes minutes. Tree-sitter parsing, scope graph construction, cross-file resolution, centrality computation, community detection, co-change analysis. It's a lot of work. Incremental updates after that are fast (checkpoint-based, only reprocesses changed files), but the first run is a real barrier on large repos.

The composite weights are hand-tuned. The weights come from testing on my own repositories and the repos of people who've given me feedback. Learned weights from agent outcome data (did the agent break something after ignoring a high-salience file?) would be real validation. It requires instrumenting agent outcomes in a way I haven't built yet.The temptation is to call these weights "empirically validated" because I've tested them across a dozen repos. Twelve is not a sample size. It's an anecdote collection.

Convention extraction is heuristic. Homer infers coding conventions by pattern-matching across the codebase. It might identify that the project uses Result<T, AppError> for error handling without understanding why that pattern exists or when it's appropriate to deviate.

Semantic analysis requires LLM access. Some deeper analysis (semantic clustering, convention summarization) calls out to language models. This costs money. The structural analysis (graph centrality, co-change detection, temporal trends) runs entirely locally and is free.

GitHub/GitLab extractors need tokens. The platform extractors for pull request history, review data, and issue linkage need API tokens. Core analysis works without them, but the behavioral signals are richer with platform data.

The Wider Field

Tornhill and CodeScene pioneered behavioral code analysis. Homer builds directly on that tradition. If you're using CodeScene for team-level insights, Homer is complementary: same behavioral foundations, different structural layer, different consumption model (MCP tools for agents rather than dashboards for humans).

Aider uses PageRank for its repository map, ranking files by graph importance to decide what context to include. Good instinct, and Homer's design was partly informed by watching how well that simple heuristic works. Homer takes PageRank as one component of a richer composite.

LocAgent (ACL 2025) applies graph-guided localization to help agents find relevant code for bug fixing. Same structural insight: the dependency graph contains information that flat file search misses. Different application: LocAgent localizes bugs, Homer assesses risk and provides context for general modification.

Augment's Code Context Engine reported 70%+ improvement from structured context retrieval. Their approach to understanding codebases is complementary. Homer provides the salience signals; context engines like Augment's determine how to retrieve and present the relevant code.

The gap: to my knowledge, no existing tool combines structural graph analysis (scope graphs, centrality, community detection) with behavioral signals (co-change, bus factor, churn, temporal trends) and exposes the result as live queryable tools for AI agents. The pieces exist separately. The integration, designed for agent consumption through MCP, is what's new.

What Remains

The problem isn't that agents lack context. It's that they lack structural context. They see files but not the graph. They see what changed recently but not what matters permanently. They can read documentation about what code does without knowing that this particular file is a betweenness bottleneck between three subsystems, has a bus factor of one, and has been gaining structural centrality for six months while nobody touches it.

The weights will get better. The temporal analysis will deepen. The cold start will get faster (there's a clear path through parallel extraction I haven't implemented yet). The agent interaction mining is early but I think it's the most interesting direction: the codebase itself can't tell you where agents get confused, but the agent session logs can.

Salience is structural, not just behavioral. Churn tells you what's active. The dependency graph tells you what's important. Temporal analysis tells you what's becoming important. Agents need all three, at query time, before they touch code. The file that nothing else in your toolchain flags, the one that hasn't changed in over a year, that half your codebase quietly depends on, whose sole author left eight months ago: that's the file your agent is about to break.

github.com/rand/homer