---
title: "Code is not text."
slug: code-is-not-text
date: "February 2026"
read_time: "4 min read"
word_count: 1247
excerpt: "Git is one of the greatest pieces of software ever written. It tracks text files brilliantly. But code has structure that text doesn't, and there's a lot you can do once your tools understand that."
filed_under: ["Version control","Code intelligence","Agent systems"]
---

# Code is not text.

Git is one of the greatest pieces of software ever written. The content-addressable object model, the DAG of commits, the branching system that makes parallel work feel natural. It solved distributed version control so thoroughly that nobody seriously tries to replace it anymore, and for good reason. When people complain about git, they're usually complaining about its CLI, not its architecture. The architecture is brilliant.

But git was designed to track text files, and source code, while stored as text, has structure that plain text doesn't. A Python file isn't just a sequence of lines. It's a collection of functions and classes, each with defined inputs and outputs, connected to other functions and classes in other files through import statements and function calls. Git doesn't know any of this, and it was never supposed to. That wasn't the problem git set out to solve. But it means there's a layer of understanding that's missing from the tools we use every day.

Consider what happens when a diff is expressed in lines versus entities. A typical file has L lines but only E entities, where E is much smaller than L, often by an order of magnitude. A file with 300 lines might contain 15 functions. When someone changes a few of those functions, git reports the diff in terms of L: you see some number of lines added, removed, or modified, grouped by file. But the actual semantic content of the change, the thing you need to understand in order to review it, is proportional to E.

**CLAIM 1.2.** The signal-to-noise ratio of a line-level diff is bounded above by E/L — for most source files, a small number.

> The ratio of signal to noise in a line-level diff is roughly *E / L* — and for most files that's a small number.

## What is the semantic gap?

This is especially important for AI agents, and understanding why requires thinking about how agents process code. An agent reviewing a pull request pays a cost proportional to the number of tokens it has to read. Line-level diffs are expensive: every changed line, every context line, every reformatted line costs tokens. But the number of decisions the agent actually needs to make is proportional to E, the number of changed entities.

If you feed the agent a line diff, it spends most of its token budget on noise. If you feed it an entity diff, it spends almost all of its budget on signal. The difference isn't marginal. In a codebase where L/E is 20, you're asking the agent to do 20× the work for the same amount of understanding.

That's the gap we set out to fill with [sem](https://github.com/Ataraxy-Labs/sem). It sits on top of git and adds a layer that understands the structure of code. Instead of seeing a file as a sequence of lines, sem sees it as a collection of entities: functions, classes, methods.

## The dependency graph

Each entity gets a structural hash computed from its AST rather than its text, so two versions of a function that look different but do the same thing produce the same hash. Reformatting doesn't change the hash. Renaming a local variable doesn't change it. Adding a comment doesn't change it. Only changes to the actual logic register as changes.

> An agent with the entity graph focuses on *E + D* — everything else is noise the compiler has already answered.

But the most important thing sem adds, and the thing that matters most for agents, is the dependency graph. Because sem understands entities, it can build a cross-file graph of which functions call which other functions, across the entire codebase. And once you have that graph, you can answer a question that no amount of LLM reasoning can reliably answer: if I change this function, what else might break?

We originally built sem because we needed exactly this for our own agents. But it turns out that what agents need and what humans need are the same thing, and probably always have been. Both agents and humans have limited attention. Both want to know what changed, whether it's a real change or cosmetic, and what it affects. Code has always had structure. Our version control tools have always tracked text. There's room for a layer in between that bridges that gap.
