Friday, 3 April 2026

Great Claude Code Leak and Under the hood - Claude code | Codex | Gemini

Accidental leak of the Claude Code source code on April 1, 2026, has provided an unprecedented look into Anthropic's agentic architecture. With thousands of mirrors now circulating online, the industry has a rare opportunity to analyze the prompt design decisions and tool-use frameworks that power high-end coding agents. This is the ideal moment to conduct a comparative study of how leading AI companies structure their internal developer workflows

What the system prompts of Codex CLI, Gemini CLI, and Claude Code reveal about each team's theory of AI reliability — and what that means if you're building agents yourself.

Link to prompts you are eager to read that first

Codex System Prompt

Gemini System Prompt

Claude Code System Prompt

Every system prompt is a Natural Language Program, list of instructions — a code of how an AI agent becomes reliable.

When OpenAI, Google, and Anthropic each built their flagship coding CLI tools, they made the same bet differently: that there exists a root cause for agent failure, and that the right prompt addresses it at the root.

Reading the published system prompt structures for Codex CLI, Gemini CLI, and Claude Code side by side, what emerges is not a feature comparison. It is three distinct philosophies of control.

OpenAI says: give the model a coherent identity and it will make coherent decisions.

Google says: give the model explicit operational procedures and the decisions follow.

Anthropic says: enumerate what the model must never do and the safety boundary itself becomes the guarantee.

Every company building on top of these models will face the same architectural choice. Understanding what the frontier labs chose — and why — is a prerequisite for making that choice well.

Identity, Process, and Constraint as Design Primitives

Codex CLI's prompt is dominated by persona construction. Personality, values, interaction style, escalation behavior — the overwhelming share of prompt surface area is spent answering: who is this agent? The implicit theory is that a model with a coherent, well-specified identity will produce coherent behavior by inference. Tell the model it is pragmatic, rigorous, and respectful; that it values clarity over cleverness; that it should challenge bad requirements rather than silently comply — and the specific behaviors emerge from that character.

Gemini CLI takes the opposite approach. The prompt allocates most of its weight to operational procedures: context efficiency strategies, search-and-read patterns, development lifecycle phases (Research → Strategy → Execution), sub-agent orchestration instructions. The model's identity is thin. The workflow is thick. The implicit theory is that reliable outputs come from constraining the action space rather than shaping the decision-making self.

Claude Code occupies a different axis entirely. The heaviest sections are not about who the agent is, nor about how it should work — they are about what it must not do. Blast radius. Reversibility. No destructive operations. Explicit OWASP threat categories. The theory here is that agent reliability is a negative property: an agent is trustworthy to the degree that it cannot cause harm, not to the degree that it has good values or follows good procedures.

OpenAI Bets on Identity

The Codex CLI prompt reads less like an instruction manual and more like a character sheet for a fictional software engineer. It specifies personality traits (pragmatic, communicative), professional values (clarity, rigor), and crucially — an escalation philosophy. The agent is explicitly told when to push back: when it detects a bad tradeoff, when requirements seem underspecified, when the pragmatic path diverges from the literal ask.

This is the most sophisticated model of human collaboration in any of the three tools. Most agent prompts tell the model what to do. Codex tells it when to refuse, and how. That is a fundamentally different relationship with the user — it treats the engineer as a peer whose judgment can be wrong, not as a principal whose instructions are commands.

The escalation section — "challenge, pragmatic mindset, tradeoff" — is load-bearing in a way that is easy to miss. It encodes a theory of collaboration: the agent's job is not to execute instructions but to contribute judgment. This is what separates a coding tool from a coding collaborator. OpenAI made that choice deliberately, and it is visible in the prompt structure.

There is a notable anomaly in the Codex prompt: the frontend tasks section, which specifically mentions bold choices, surprising colors, and visual creativity. For a CLI tool targeting professional engineers, this is unusual. It suggests one of two things: either OpenAI designed Codex for a broader creative audience than the command line implies, or the frontend callout reflects the team's belief that creative judgment — not just technical execution — is a property the agent should possess by default.

The editing constraints are instructive in their specificity. Don't amend commits. Apply patches rather than rewrites. Maintain good code comments. These are not general principles — they are the learned lessons of a team that watched models cause damage in codebases and back-encoded the failure modes into the prompt. The specificity is a learning from failure.

Full Prompt is available @ Codex System Prompt

Google Bets on Process

Where Codex builds a person, Gemini CLI builds a workflow. The prompt is structured around phases and patterns: how to search efficiently, how to read large codebases without exhausting context, when to spawn sub-agents, how the development lifecycle should flow from research through strategy to execution. Identity is thin. The word "pragmatic" does not appear. What appears instead is an explicit context budget awareness that no other tool's prompt contains.

The "Context Efficiency" section — strategic tool use, estimated context usage — is the tell. This is an infrastructure concern bleeding into the prompt layer. Google is aware that Gemini's context, however large, is a finite and expensive resource, and they have encoded context management as a first-class concern for the agent itself. The model is being asked to reason about its own resource consumption in real time.

When a company encodes "estimate context usage" into an agent's operating principles, it is admitting something: context window economics are not solved at the infrastructure layer, so they are being delegated to the agent layer. This is a runtime concern being pushed into the prompt. It is not elegant, but it is honest.

The Development Lifecycle section — Research → Strategy → Execution — is the most ambitious design choice in any of the three prompts. It tries to impose a thinking structure on the model: don't execute before you understand, don't implement before you have a strategy. Most tools treat the agent as reactive; Gemini CLI tries to make it deliberate. Whether a model actually follows this structure in practice is a different question. As a design intention, it is the clearest signal that Google is trying to build a thinking partner rather than a code-generation endpoint.

The sub-agents section is equally revealing. Gemini CLI explicitly models itself as an orchestrator: codebase investigation, CLI help, and generalist tasks are treated as separable concerns that can be delegated to specialized sub-agents. This is an architectural declaration — that the right model of AI-assisted development is multi-agent, not monolithic, and the prompt structure should reflect that from the start.

Full Prompt is available @ Gemini System Prompt

Anthropic Bets on Constraint

Claude Code's prompt has a different texture from the other two. It is not warmer or colder — it is more cautious in its diction. The language of the operations sections borrows from risk management: blast radius, reversibility, local change scope, no destructive operations. These are not metaphors. They are explicit categories that the agent is meant to evaluate before acting. The implicit model is that every action the agent takes should be assessed for its damage potential before execution, not after.

The capitalized IMPORTANT section — for security and URLs — is itself a prompt engineering technique, not merely a content category. Anthropic knows that models attend to capitalization and structural salience. Labeling a section IMPORTANT is a way of increasing the probability that the model treats its contents as non-negotiable rather than advisory. This is a team that knows how the sausage is made, and they are using that knowledge inside the prompt itself.

No other tool's prompt contains the phrase "blast radius." The use of weapons-of-war language for file operations is not accidental. It encodes a severity calibration: deleting the wrong file is not an inconvenience to be apologized for, it is a detonation. The vocabulary shapes how the model weights consequences, not just which actions it permits.

The security vulnerabilities section is the most technically specific of any prompt section across all three tools. Command injection, XSS, SQL injection, OWASP Top 10. Anthropic is not asking the agent to "be security-conscious." They are naming threat classes and expecting the agent to recognize them in context. The implicit assumption is that a model trained on enough security literature can pattern-match against named vulnerabilities in real code, and the prompt's job is to activate that capability rather than describe it from scratch.

The Compressed Conversation section — handling context limit and context window overflow — is a admission that long-running agentic sessions will hit memory boundaries, and the agent needs a recovery behavior rather than silent degradation. This is operational visibility: the prompt accounts for the session not fitting in the window, which is a runtime failure mode that most prompts ignore entirely.

Full Prompt is available @ Claude Code System Prompt

What the Surface Area Reveals

Three Design Choices You Should Consider

If you are building an AI product that involves an agent taking actions — writing code, modifying files, calling APIs — these three prompts are good reference implementation. They are proofs of three different product bets, each with predictable failure modes.

The identity approach fails gracefully in ambiguous situations but fails badly at the capability ceiling. A model with a well-specified persona makes sensible judgment calls when the instructions run out. But persona is not a substitute for operational procedure in repetitive, high-stakes workflows. When the agent needs to search a large codebase efficiently, knowing it is "pragmatic" does not help. You need the grep patterns.

For most AI products, the right prompt architecture layers all three: a thin identity layer to establish tone and judgment defaults, a procedure layer for the high-frequency operational paths, and a constraint layer for the actions where failure is not recoverable. The mistake is choosing one and applying it universally. Each layer serves a different failure mode.

The process approach fails at novel tasks. If the agent's workflow is Research → Strategy → Execution, and the user asks for something that doesn't fit that shape, the agent either forces the task into the wrong template or falls back to undefined behavior. Procedures are brittle at their boundaries. This is the same critique Rich Hickey makes of complected code — when the procedure and the judgment are tangled, changing one breaks the other.

The constraint approach fails at capability, by design. An agent that is maximally conservative about blast radius, reversibility, and destructive operations will refuse or seek permission at the moments when an experienced engineer would just act. The safety guarantee comes with a throughput cost. For consumer-facing products, this is the right trade. For developer tools used by people who understand the risk, it may be too conservative.

One structural observation that cuts across all three: none of these prompts is static and many instruction are added at run time.

The specificity of Codex's editing constraints, Gemini's context efficiency instructions, and Claude Code's OWASP threat categories all bear the fingerprints of post-hoc repair — lessons learned from watching models fail in production, back-encoded into the prompt. The prompt is not a design document. It is a running incident log, formatted as instructions.

The prompt is not a design document. It is a running incident log, formatted as instructions. Every overly specific rule is a failure that happened once.

If you want to understand what problems a team has actually encountered with their agent, read the most specific sections of their system prompt. The level of specificity is directly proportional to the pain these team faced during building tool.

So what is the story for each model ?

The prompts are archives of expensive mistakes, and reading them carefully is the cheapest form of safety research available.

Thursday, 2 April 2026

AI Engineering Terms You Will Memorize and then forget

LLM era has a reliable product cycle: someone coins a term, someone more famous endorses it, the internet credits the endorser, and LinkedIn does the rest.

Cycle is embarrassingly simple

We have done this at least three times in four years. Each time with complete sincerity. Each time with worse naming.

The progression is instructive. Prompt engineering at least had the decency to describe what you were actually doing — writing prompts.

Context engineering was already stretching it; the word "engineering" is doing considerable structural load-bearing for what Shopify CEO Tobi Lutke accurately described on June 18, 2025 as "the art of providing all the context for the task to be plausibly solvable by the LLM.

One week later, Andrej Karpathy endorsed the term with a longer explanation.

This is the naming mythology rewriting itself in real time, which is either poetic or depressing depending on your tolerance for how information actually spreads.

Harness engineering arrived in February 2026, formalized by Mitchell Hashimoto and documented by an OpenAI team who used it to describe building a million lines of agent-generated code. The OpenAI post is worth reading carefully, because what it actually describes — at length, with genuine insight — is: write config files, enforce architectural patterns with linters, and run cleanup jobs to remove code the agents wrote badly. In previous decades we called this maintenance. We did not issue certifications for it.

A brief taxonomy, in ascending order of naming audacity

Prompt engineering (2020 onward). You wrote better prompts. The model got better and started understanding worse prompts. The practice quietly became "just describe what you want" and nobody held a funeral. Peak era: elaborate system prompts explaining that the model was a helpful assistant who should be honest. The model already knew. The prompt was load-bearing only for the engineer's confidence.

Context engineering (June 2025). Toby Lutke named it. Karpathy amplified it. The internet misattributed it. The underlying activity — deciding what information the model needs to do its job — predates LLMs by however long humans have been briefing other humans before asking them to do things. The new contribution was giving that activity a name that sounds like it requires a degree. Within weeks, "context engineering" had a LangChain blog post, a Hugging Face explainer, an Anthropic guide, and a GitHub repository with a biological metaphor that progressed from atoms to neural fields. The speed from named to over-explained remains a record.

Harness engineering (February 2026). The OpenAI post describes three engineers spending every Friday cleaning up what they called "AI slop" before they automated that cleanup into a recurring agent task. This is a genuinely useful observation. It is not a new discipline of engineering. It is a description of what happens when you run software in production and things go wrong, which has been happening since software existed in production.

AI slop is low-quality, mass-produced digital content generated by artificial intelligence that lacks human effort, meaning, or artistic value

Why this keeps happening, and who benefits

Naming game is not accidental and it is not innocent. A new term does three things simultaneously: it creates a hiring category, it creates a product category, and it creates a reason to hold a conference. All three monetize faster than the underlying practice matures.

Vendors benefit most cleanly. The company whose engineer names the discipline owns the default tooling search. OpenAI documented harness engineering using Codex. OpenAI sells Codex. The educational content is the distribution strategy wearing a lab coat.

It is just how technology markets work. The observation worth making is that it works every single time, on an accelerating schedule, with no apparent ceiling on how quickly a named practice can generate a certification program.

Engineers benefit more ambiguously. A new title justifies a salary band and provides a legible identity in a market where "I work with AI" is too broad to be useful.

The cost is that the identity couples to practices with an expiry date.

The prompt engineer of 2022 discovered this. The context engineer of 2025 is discovering it now. The harness engineer of 2026 has perhaps eighteen months before the runtime absorbs the harness and the job title requires a new noun.

Every AI engineering term names a gap between what the model can currently do and what the business currently needs. The gap is real. The engineering discipline named after it is temporary. These are compatible facts that the certification market prefers not to foreground.

Predictions — clearly labeled as such

Following are forward-looking extrapolations that are likely to emerge soon. I want to present these early so that if they catch the attention of influential voices or gain traction publicly, they can spark a self-reinforcing cycle — from hiring categories to product ecosystems to conferences — ultimately unlocking significant value. And who knows, it might even make me a bit famous.

The next term will probably be verification engineering, describing the practice of checking whether agent output is correct before it causes a problem in production. This is currently called testing. It will get renamed when a sufficiently publicized production failure is traced to inadequate output validation from an autonomous agent, and when that failure generates enough LinkedIn posts to constitute a discourse.

After that, something like decomposition engineering — the practice of breaking high-level goals into units of work that agents can handle without producing nonsense. The OpenAI harness post describes this as their team's primary job: working "depth-first, breaking down larger goals into smaller building blocks."

This activity already has a name in project management. It will get a new name when someone publishes a paper showing that agent output quality correlates more strongly with task decomposition quality than with model selection. The paper will be correct. The naming will still be funny.

At some point the industry will notice that the "prior art" column and the "AI engineering term" column describe the same activities, and software engineering — which has existed since the 1960s — will be quietly declared to have been context-harness-verification engineering all along. A retrospective blog post will be written. It will get many LinkedIn reposts.

What actually transfers

Why i wrote this satire ?

Building reliable systems around non-deterministic components is genuinely hard. The existing vocabulary of software engineering does not perfectly fit — a prompt is not a function, a context window is not a database, an agent failure is not a standard exception. The gap in vocabulary is real, and new terms can be useful even when the underlying activity is old.

The mistake is not naming practices. The mistake is mistaking the name for the skill. The engineer who understands why a system should behave a certain way — and can specify, verify, and maintain that behavior across model versions — will be fine regardless of what the current term is. The engineer who has mastered the current term and little else will be retraining when the next model ships.

The models will keep improving. The limitations will keep shifting. The terms will keep coming. And somewhere, a course will always be ready before the ink on the blog post is dry.

Friday, 27 March 2026

Similarity is not Relevance

There is a subtle confusion baked into every LLM-powered system in production today, and it is responsible for a larger fraction of failures than most teams realize. The confusion is this: we have built systems optimized for similarity, and we have shipped them as if they deliver relevance. They do not, and the difference is not academic.

Similarity is a geometric property. Two things are similar when they are close to each other in some metric space — cosine distance between embeddings, edit distance between strings, perplexity under a language model. It is computable, differentiable, and entirely indifferent to purpose. Relevance, by contrast, is teleological. Something is relevant if it advances a goal, reduces uncertainty, or changes what you should do next. Relevance is defined relative to an intention. Similarity is blind to intention.

Every major component of modern LLM stacks — retrieval, generation, alignment — is built on similarity. When they fail, they fail for the same reason: they found what was close, not what was needed.

The model is always correct about what is similar. It has no native mechanism for knowing what is needed.

Similarity ≠ Relevance

The autocomplete that always answers the wrong question

Consider a developer working on a distributed payment service. She types a function signature for retry logic with exponential backoff and asks the coding assistant to complete it. The assistant produces a clean, syntactically valid implementation — well-formatted, documented, handling the common cases. It looks exactly like the retry logic that appears in ten thousand open-source repositories.

What the assistant has done is retrieve and synthesize code that is maximally similar to retry logic in its training distribution. What the developer needed was retry logic that respects the idempotency contract already established elsewhere in the codebase, coordinates with the circuit-breaker state that her colleague committed last week, and avoids the cascading retry storm that their incident review identified two sprints ago. None of that information lives in the local similarity neighborhood of "exponential backoff implementation."

The assistant solved the problem it could measure. It optimized for syntactic and semantic proximity to known good code. But relevance in this context is defined by the surrounding system — the architecture decisions, the failure post-mortems, the implicit contracts between services. These are irreducibly contextual. They do not compress into an embedding.

Deeper problem is that similarity-based retrieval actively misleads by presenting confident outputs. A retrieved chunk with cosine similarity 0.91 feels authoritative. Developer accepts it, integrates it, and the failure surfaces in production three months later — not as an obvious crash, but as a subtle degradation under specific load patterns.

Similarity score was high and relevance was near zero.

Fancy Word , Empty Head

Email generation is where the similarity-relevance gap is most visible and least discussed, because the outputs feel so undeniably correct.

You ask the model to draft a follow-up email to a client who missed a deadline. It produces something professional, appropriately apologetic, clear in its next-step request, and tonally calibrated to business correspondence.

Every sentence resembles what a senior professional would write in this situation.

But "resembles" is exactly the problem.

The model has matched the surface pattern of the email genre. It does not know that this particular client is two weeks from the end of their annual contract and the conversation has been quietly tense since a pricing dispute in Q3.

It does not know that the missed deadline was likely caused by a restructuring on their side that your account manager mentioned in passing on Slack.

It does not know that a direct ask for a new timeline would land badly right now, while an offer of support would open the door. The relevant email is defined by that relational history, not by similarity to the genre of follow-up emails.

The model produces text that is similar in form to a good email. It has no mechanism for knowing whether the email is good for this situation.

What gets produced is a document that would earn an A in a business writing course and accomplish nothing — or worse, accelerate a deteriorating relationship by applying a generic professional register to a moment that required something specific. The failure is invisible because the output is fluent. Fluency is a similarity property.

It measures proximity to well-formed text. It says nothing about whether the text does the right work in the right moment.

This is where RLHF compounds the problem. Human raters, presented with the email during training, reward it — because it looks like good professional writing.

The model is trained to produce outputs that humans rate as high quality in isolated evaluation. But isolated evaluation cannot capture relational context.

The model gets better at producing emails that resemble good emails. The gap between resemblance and genuine utility quietly widens.

When grouping by proximity destroys meaning

Clustering is the case that most directly exposes the architectural assumption underneath the whole stack. When you cluster documents, support tickets, or customer feedback using LLM embeddings, you are grouping by geometric proximity in the embedding space. The algorithm puts similar things together. This is, on its face, exactly what clustering should do.

Except that the purpose of clustering is never geometry. The purpose is always analytical — you are trying to understand the structure of a problem, identify actionable segments, or surface patterns that inform a decision. And those analytical goals define what "same group" means, independently of what "similar text" means.

A support ticket that reads "the dashboard is slow" and a ticket that reads "the API is timing out" might be semantically distant in embedding space — different vocabulary, different technical register, different surface description. But if both are caused by the same database query bottleneck, they belong in the same bucket for the engineering team. Conversely, two tickets that both say "I can't log in" might be superficially identical but one is a password reset issue and one is an account suspension, and routing them to the same team is actively harmful.

Similarity clusters by surface. Relevant clusters by cause. The right grouping depends on what you intend to do with the groups.

The geometry of the embedding space does not know that your goal is actionable routing. It knows word co-occurrence patterns. Sometimes those align. When the stakes are low, the alignment is good enough. When you are making resource allocation decisions, prioritizing engineering work, or segmenting customers for intervention, the gap between what is similar and what is relevant determines whether the analysis was worth running.

The seductive part is that similarity-based clusters look coherent. The topics within each cluster feel related. The outputs pass a plausibility check. But plausibility is another similarity property — it measures whether the output resembles something true. It does not measure whether the groupings actually serve the analytical purpose for which the clustering was run.

Three faces but single failure

Across all three cases — the code that fits the genre but breaks the system, the email that sounds right but says the wrong thing, the clusters that are coherent but not actionable — the failure has identical structure.

The model or the pipeline optimized for a measurable proxy (syntactic similarity, surface fluency, geometric proximity) and produced an output that scores well on that proxy.

The proxy and the goal coincided in the average training case. They diverged in the specific deployment case. The system had no way to detect the divergence.

This is not a hallucination problem. The outputs in all three cases can be entirely accurate in a narrow sense. The code is syntactically correct. The email is factually unobjectionable. The clusters are internally coherent. The failure is not falseness — it is misalignment between what was optimized and what was needed.

What this means practically is that the verification burden sits entirely with the human in the loop. Every LLM output comes pre-packaged with high confidence and fluent presentation — both similarity properties — and zero signal about whether it is relevant to the specific situation at hand. The engineer must know the system well enough to see past the fluent implementation. The account manager must know the client well enough to see past the professional tone. The analyst must know the business well enough to see past the coherent clusters. The AI provides the shape of an answer. Relevance is still a human judgment.

Conclusion

None of this means the tools are not useful. Similarity to good outputs is a genuinely valuable prior.

A coding assistant that produces implementations similar to idiomatic, working code accelerates the developer who knows the system.

An email assistant that produces text similar to professional correspondence accelerates the writer who knows the relationship.

The similarity machinery handles the generic, leaving the expert to handle the specific.

The error is the frame — treating outputs that scored high on similarity as if they had been evaluated for relevance.

They have not been. They cannot be, because relevance requires the deployment context that was absent at training time.

The model is excellent at finding what is close. Determining whether what is close is what is needed remains, stubbornly, a problem for the human who knows what is needed.

Confusion about this distinction is major issue. It is the source of an entire category of quiet, confident, professionally-formatted failures.

Sunday, 22 March 2026

Induced Demand Loop: Anthropic Sells You the Problem, Then the Solution

Anthropic built Claude Code to write your software. They have done awesome job to make it the most preferred agentic coding tool. It makes sure that you generate best code at first time or with shorter loops.

Now it sells Claude to review what Claude wrote. The snake has found its tail — and this is not an accident.

There is a pattern in business history that feels, the first time you notice it, like a conspiracy. A company creates a category of problem, then creates the solution, then collects rent from the gap between the two.

Security consultancies who audited the systems they also architected.

ERP vendors who sold implementation services for the complexity they introduced.

Management consultants who institutionalized the inefficiencies they were paid to eliminate.

The AI era has produced its own version of this. It is more elegant than the historical ones — structurally self-reinforcing in a way the older models could only approximate. And Anthropic, with the quiet launch of code review as a product category following Claude Code, has demonstrated the loop with unusual clarity.

First, They Shipped the Generator

Claude Code is, at its core, an autonomous coding agent. It reads your codebase, writes implementations, refactors modules, scaffolds tests, and submits pull requests with the confidence of a senior engineer who has never experienced the social cost of a bad review. It is fast, tireless, and cheap. It is also — and this matters — statistically wrong in ways that are difficult to detect without reading every line it produces.

The product was sold, correctly, as a productivity multiplier. The pitch was straightforward: software engineering is bottlenecked on implementation speed, and Claude Code removes that bottleneck. Ship faster. Do more with fewer engineers. The implementation is no longer the hard part.

What this framing quietly omitted was the second-order effect. If you remove the implementation bottleneck, you do not get the same system running faster — you get a different system running under entirely new constraints. The bottleneck shifts. And the new bottleneck, almost inevitably, is verification.

The speed of generation outpaces the speed of comprehension. Code review was already the slowest lane on the engineering highway. Claude Code just added ten more lanes of traffic.

Every line that Claude Code writes must be read by someone who understands it well enough to sign off on it. That person is, in most organizations, increasingly rare.

The engineers who remain after a round of AI-enabled headcount reduction are the ones reviewing output, not producing it. They were already stretched. Now they are reviewing five times as much code per day. Quality degrades. Bugs ship. Technical debt accumulates at the speed of token generation.

Then, They Shipped the Reviewer

The code review product is the second half of the loop. It reads the code — implicitly, the code that Claude Code wrote — and identifies issues, suggests improvements, flags security concerns, enforces architectural consistency. It is, in essence, an AI that reviews the output of a different AI trained by the same company, sold to the same customer, billed on the same invoice.

The symmetry is so clean it almost obscures the mechanism. But the mechanism is precise: Claude Code created the supply of unreviewed code. Code review created the demand for reviewing it. The company captures value on both ends of the transaction. The customer pays twice for a problem they did not have before they adopted the first product.

The Pattern, Precisely

This is not identical to the older consulting-firm model, where the problem was manufactured through advice. Here, the problem is an emergent property of the product itself. Claude Code does not intend to create review debt — it simply does, structurally, as a consequence of its own efficiency. It is the rational response to a real problem. The fact that the same company profits from both sides is not malfeasance. It is alignment.

This is what i call the induced demand pattern — AI tools that structurally generate the conditions for their own expansion. The code generation category is the clearest instance yet. Generate more code, create more review surface, sell more review tooling, use that revenue to train better generation models, which generate more code. The loop is not just self-sustaining. It is self-accelerating.

Why the Snake Eats Its Own Tail

The ancient image of a serpent consuming itself — was originally a symbol of cyclical renewal. The snake does not die; it feeds itself, perpetually. This is an accurate metaphor for what Anthropic has constructed.

The model that reviews the code learns from what it reviews. The patterns it flags become training signal for the model that writes the code next time. The review product improves the generation product, which increases the volume of code requiring review, which expands the market for the review product. There is no exterior — no part of this loop that does not feed back into the loop itself.

Compare this to the classical tech platform flywheel, where more users attract more sellers who attract more users. That loop is linear in its dependencies — it requires external participants at every node. The AI coding loop is tighter. The only external participant is the engineer, and even the engineer's role is progressively compressed as each generation of the model improves. The loop internalizes its own demand generation.

Implication for Engineers

The engineer who adopts Claude Code and then adopts the code review product has not automated away two separate problems. They have enrolled in a subscription to a problem-solution pair that is jointly managed by a vendor whose revenue depends on both sides of it remaining necessary. This is not a reason to reject the tools — the productivity gains are real, and the competitive pressure to adopt them is overwhelming. But it is a reason to be precise about what is actually happening.

The skills that used to be valuable in this workflow — the ability to write clean code quickly, to hold an architectural pattern in your head while implementing it — are being hollowed out from below. The skills that survive this compression are the ones at the top of the evaluation chain: the ability to read code written by someone else (or something else) and judge it accurately. The ability to know what a correct system feels like before you have built it. The ability to detect subtle errors in logic that no statistical model will flag because no statistical model has ever understood what the code is supposed to do.

The review product is not your ally in this dynamic. It is a product that profits most when the gap between what gets generated and what is actually correct remains large enough to require continuous attention.

This is the tension that no product announcement will name directly. Code review tooling, like all automated verification, has an incentive structure that is subtly misaligned with actually closing the verification gap.

A perfect reviewer would put itself out of business. A profitable reviewer finds just enough to flag that you keep paying — while the deeper architectural drift, the slow divergence between what the system does and what it should do, accumulates beneath the surface of any automated check.

What the Pattern Predicts

If the induced demand pattern holds — and structurally, I believe it will — the next several years of AI developer tooling will follow a predictable shape. Every tool that accelerates a phase of the engineering lifecycle will create a corresponding tool that manages the debt that acceleration produces. Test generation will be followed by test quality analysis. Documentation generation will be followed by documentation accuracy verification. Architecture suggestion will be followed by architecture review.

Each pair will be sold by the same vendors, or by vendors whose incentives are structurally identical. Each pair will be presented as the solution to a problem, while quietly sustaining the conditions that make the problem recur. The stack will grow upward, each layer extracting value from the gap created by the layer below it.

The engineers who navigate this without becoming permanently dependent on it are the ones who maintain a clear model of what the system is supposed to do — not just what it currently does. That model is not a product. It cannot be sold, automated, or subscribed to. It is built slowly, through exposure to consequences, through the experience of being wrong in ways that matter and learning why.

Judgment compounds. Skills depreciate.

human judgment as cloud Function

Anthropic is not cynically manufacturing problems. The induced demand here is emergent, not engineered. But emergent does not mean neutral. The structure rewards continued dependence, punishes the development of in-house evaluation capability, and gradually transfers the judgment function — the most valuable thing an engineering team possesses — to a vendor whose model of your system is forever incomplete.

The snake eats its tail. The tail grows back. The snake is always hungry.

Friday, 13 March 2026

The Tail at Scale

When engineers talk about latency, they almost always talk about average latency. P50. The median experience. It's a comfortable metric — it responds to optimization, it's easy to visualize, and it makes dashboards look good. The trouble is that in any non-trivial distributed system, average latency is nearly irrelevant to whether your system actually feels fast.

A 2013 paper from Google, The Tail at Scale, reframed the entire conversation. The insight was simple: in a system where a single user request fans out to hundreds of backend machines, the response time is determined not by the average machine but by the slowest machine in the fan-out. The 99th percentile is not a corner case. It is, structurally, the common case for any sufficiently large request graph.

This is the founding observation behind Google's latency engineering philosophy — and it reshapes how you think about almost every architectural decision in a shared environment.

"In a large enough system, the tail is the average. The stragglers are not noise — they are the signal."

Why Shared Environments Are Inherently Hostile to Latency

A shared environment — whether it's a multi-tenant cluster, a distributed storage layer, or a cloud runtime shared across teams — introduces a class of latency that has nothing to do with your code. It comes from contention: for CPU time, for memory bandwidth, for network queues, for disk I/O. These are resources that your workload competes for with processes it has no visibility into and no control over.

Result is what the paper calls "variability amplification." Even a single machine exhibiting transient slowness — a GC pause, a cache eviction storm, a background compaction job — introduces latency that propagates through the system in ways that are entirely disproportionate to the duration of the original event. A 50ms hiccup on a single shard becomes a 200ms tail for every request that happened to touch that shard during that window.

This is the fundamental problem. And the conventional response — "profile and optimize the slow path" — doesn't work, because the slowness is not in the code. It's in the environment. You cannot optimize away a garbage collector running on a machine you don't control, or a noisy neighbor saturating the memory bus two NUMA nodes away.

Hedged Requests

The first and most important pattern the paper describes is the hedged request. The idea is counterintuitive enough that it's worth stating plainly: rather than waiting for a slow server to respond, you send the same request to a second server after a short delay — and take whichever response arrives first.

The delay is critical. You don't want to double your load by default. Instead, you observe your system's typical P95 response time and use that as the hedge threshold. If a request hasn't completed within that window, you issue an identical request to a different replica and race them. The moment either one responds, you cancel the other.

The practical effect is remarkable. Measurements at Google showed that hedging could reduce 99.9th percentile latency by an order of magnitude while increasing load by only a few percent — because most requests don't trigger the hedge at all, and those that do are precisely the ones stuck on slow replicas.

Hedged requests trade a small amount of extra load for a large reduction in tail latency. Load amplification is bounded by the fraction of requests that exceed your hedge threshold — which, by definition, is a small minority if you set the threshold near P95.

What makes this pattern powerful in shared environments specifically is that it sidesteps the cause of slowness entirely. You don't need to know why Replica 1 is slow. You don't need to detect it, alert on it, or drain it. You just race around it.

Tied Requests and Cancellation

Hedged requests have a subtle problem: if both replicas are fast, you've wasted work on both. Tied request pattern refines this by introducing coordination between the two requests. When you issue a hedge, you attach a "cancellation token" that the replicas share. Whichever replica starts processing the request first notifies the other to cancel, and proceeds alone.

This is particularly valuable when requests are expensive to process — when the work itself consumes significant CPU or I/O on the backend. Instead of duplicating work silently, tied requests minimize wasted computation by communicating intent across the request boundary.

The implementation requires some infrastructure: replicas need to be aware of each other's state for a given request, which typically means either a shared coordination layer or an out-of-band cancellation channel. In Google's architecture, this was handled via internal RPC cancellation propagation. In most systems, you can approximate it with request-scoped context cancellation — the Go context.Context model being a modern analogue of this idea.

Micro-Partitioning and Fine-Grained Load Balancing

Both hedging and tying are reactive: they respond to latency after it has occurred. Complementary proactive strategy is micro-partitioning — dividing work into far more partitions than you have machines, so that load imbalance between logical partitions can be corrected by reassigning partitions rather than migrating state.

Intuition is straightforward. If you have 100 machines and partition your keyspace into 100 shards, a hot key on one shard means one machine is overloaded and there's nowhere to move it without a full reshard. If instead you have 10,000 virtual partitions distributed across 100 machines, a hot partition can be migrated to a less-loaded machine in seconds, with minimal disruption.

This is less a trick and more a structural principle: partition granularity determines your ability to respond to imbalance. Google's Bigtable uses this extensively — tablet splits are designed to be cheap precisely so that hot tablets can be redistributed across tablet servers without downtime.

Good Citizens and Background Throttling

In any shared environment, there are two classes of work: foreground requests with latency SLOs that users directly feel, and background work — compaction, replication sync, index rebuilds, garbage collection — that has no user-visible deadline but consumes the same physical resources. The conflict between these two classes is one of the most consistent sources of latency spikes in production systems.

Solution is conceptually simple: background tasks must be "good citizens." They should yield CPU and I/O to foreground work when demand is detected. In practice, this means implementing throttle mechanisms that observe system load indicators — request queue depth, disk I/O wait, CPU steal — and automatically back off when those indicators cross a threshold.

Google's approach to this problem includes priority queues in their RPC layer, where foreground traffic can preempt background work mid-execution. Bigtable's compaction scheduler monitors foreground request rates and adjusts compaction aggressiveness in real time. The principle is that background jobs should "earn" their CPU time during slack periods, not consume it as a fixed entitlement.

What's important here is that this isn't optional in a shared environment — it's a contract. If your background jobs don't throttle, you are imposing your latency cost on every other workload sharing your infrastructure. In large organizations, this becomes a coordination problem: the team running the nightly reindex doesn't know which other team's latency SLO they're violating at 2am.

Selective Replication of Hot Data

The patterns above all treat slowness as something to route around or absorb. This final pattern takes a different approach: eliminate the bottleneck entirely for the data that matters most.

In most systems, data access follows a power-law distribution. A small number of keys — a viral post, a high-traffic configuration value, a globally shared counter — account for a disproportionate fraction of reads. These hot items are precisely the ones most likely to create queuing delays, cache evictions, and server-level saturation.

Solution is selective, on-the-fly replication of hot items. Rather than replicating everything uniformly, the system detects hot keys — through access frequency monitoring or explicit client hints — and creates additional in-memory replicas across multiple servers. Reads are then distributed across those replicas, reducing per-server load for the items that need it most.

This pattern is now standard in systems like Memcached (Facebook's lease mechanism was a direct response to this problem), Redis Cluster, and modern distributed caches. The underlying principle — don't treat all data as equally hot, and adapt replication depth to observed access patterns — generalizes far beyond caching.

Latency-Aware Load Balancing

Round-robin load balancing assumes that all backend servers are equivalent and equally available. In a shared environment, this assumption fails constantly. A server experiencing memory pressure or CPU saturation will accept requests at the same rate as a healthy one — queuing them invisibly while the client believes load is distributed evenly. The result is that round-robin actively routes traffic into latency holes it cannot see.

Latency-aware load balancing corrects this by making routing decisions based on observed response times rather than theoretical capacity. Client maintains a rolling measurement of each backend's recent latency and biases requests toward the faster ones. The simplest version is the "power of two choices" algorithm: rather than picking randomly from all backends, pick two at random and route to whichever has the lower current latency. The probabilistic gain is disproportionate to the cost — two random samples are enough to avoid the worst servers most of the time.

Elegance of this approach is that it requires no central coordinator and no global view of server health. Each client maintains its own local latency measurements independently. The collective effect of many clients doing this converges on a system-wide load distribution that naturally isolates slow servers — without any explicit health-checking infrastructure.

Google's gRPC and Envoy proxy both implement variants of this. Netflix's Ribbon client-side load balancer added latency-based weighting as a core feature after observing that round-robin was systematically directing traffic into degraded nodes during partial cluster failures.

Request Deadline Propagation

Every distributed request has an implicit budget: the maximum time the user is willing to wait before the response becomes useless. A search result that arrives after the user has navigated away is not a slow success — it is a waste of resources that could have been spent on a fresher request. Yet most systems treat their internal RPC calls as if they exist outside of time, with no awareness of how much of the outer deadline has already been consumed.

Deadline propagation makes the remaining time budget explicit and transmits it across every service boundary. When a frontend handler receives a request with a 200ms SLO and spends 30ms doing authentication, the downstream RPC it issues should carry a deadline of 170ms — not an unconstrained call that could block for seconds. Each hop in the call graph receives a shrinking time window, and each service is expected to abandon work and return an error rather than continuing once that window closes.

Deadline propagation transforms timeout from a per-hop configuration into a system-wide invariant. Instead of each service having its own independently configured timeout — which can add up to far more than the user's actual patience — the deadline flows through the entire call graph as a shared, decrementing constraint.

Without deadline propagation, a slow backend continues burning CPU on a request whose answer will never be seen. The frontend has already returned an error to the user, but the downstream services don't know this — they keep working, consuming resources that could serve other requests. With deadline propagation, a cancelled frontend request immediately cancels the entire downstream tree. The work stops the moment it becomes irrelevant.

Go's context.Context is the most widely adopted implementation of this idea in modern systems. Passing a context with a deadline through every function call is the idiomatic Go way of expressing exactly this contract. The Dapper tracing system and gRPC's deadline mechanism implement the same principle at the RPC layer.

Probabilistic Early Completion

There is a class of read-heavy workload where the question "what is the correct answer?" is less important than "what is a good enough answer, returned quickly?" Search ranking, recommendation feeds, autocomplete suggestions, approximate analytics — in each of these, the value of the response degrades gradually with quality, not catastrophically. A slightly stale recommendation list is almost as useful as a fresh one. A search result that includes 98% of relevant documents is indistinguishable from one with 100%, from the user's perspective.

Probabilistic early completion exploits this tolerance by allowing a request to return as soon as it has gathered "enough" signal, rather than waiting for every shard to respond. The coordinator tracks how many responses have arrived and, once a statistically sufficient fraction of shards have replied, returns the aggregated result rather than waiting for the stragglers. The remaining responses, when they eventually arrive, are discarded.

The fraction required is a tunable parameter that encodes the application's quality-vs-latency tradeoff. Setting it at 90% means the request finishes when 9 of 10 shards have responded — the one slow shard no longer determines the outcome. The quality loss is bounded by the fraction omitted, and in practice for approximate workloads the loss is negligible while the latency gain is substantial.

Probabilistic early completion only makes sense when partial results are semantically valid. It is appropriate for search, recommendations, aggregated metrics, and autocomplete. It is inappropriate for financial transactions, inventory updates, authentication checks, or any computation where partial data produces incorrect rather than merely approximate output.

Overload Admission Control

All of the patterns discussed so far are concerned with how an individual request navigates a slow or overloaded system. This one operates at a different level: preventing the system from accepting more work than it can complete within latency bounds in the first place.

The counterintuitive observation is that queueing is not a buffer — it is a latency amplifier. When a service is operating at capacity and accepts additional requests into a queue, those requests do not get served "slightly later." They get served much later, because every subsequent request must wait behind everything already in the queue. The 99th percentile latency of a system at 95% utilization can be ten times worse than the same system at 80% utilization, even though the throughput difference is modest.

Admission control accepts this reality and acts on it. Rather than allowing the queue to grow unbounded during traffic spikes, the system measures current utilization — active request count, queue depth, recent latency percentiles — and explicitly rejects incoming requests when those indicators cross a threshold. The rejected requests receive an immediate error rather than a delayed one. From the client's perspective, a fast rejection is often preferable to a timeout: it can retry against a different backend, fail fast, or serve from cache, rather than hanging indefinitely.

Google's internal systems use a technique called "client-side throttling" where the client itself participates in admission control: it tracks its own recent reject rate and probabilistically drops requests before sending them, reducing load on an already-stressed backend without requiring the backend to process and reject each request individually. Netflix's Concurrency Limits library implements a similar adaptive mechanism based on TCP congestion control algorithms — treating the request queue like a network pipe and backing off as soon as it detects queuing delay increasing.

Latency SLO Budgeting Across Teams

The patterns above are all technical. This last one is organizational, but its absence makes every technical pattern less effective.

In a large engineering organization, a single user-facing latency SLO — say, P99 < 300ms — is actually a composite of dozens of internal service SLOs. The frontend has a budget. The auth service has a budget. The ranking service has a budget. The storage layer has a budget. When these budgets are implicit, undocumented, or uncoordinated, teams make local decisions that are individually reasonable but collectively catastrophic. The auth team tightens its internal retry logic, adding 20ms to every call. The indexing team adds a synchronous cache warm-up step. Neither change violates any documented contract, and neither team knows what the other did. The cumulative effect is a P99 regression that shows up in the frontend SLO and takes weeks to attribute.

Latency budgeting makes these implicit contracts explicit. Each service in a call graph is assigned a latency budget — its maximum allowed contribution to the end-to-end P99 — derived from the top-level SLO. Changes that affect that budget require coordination across the services that share the call path. The budget is measured, reported, and treated as a first-class engineering constraint, like memory or CPU quota.

This is less a distributed systems pattern and more a systems-thinking pattern. Latency is a shared resource in the same way that bandwidth or storage is a shared resource. The only difference is that it is invisible until the moment it fails, at which point attribution is painful and slow. Making latency budgets explicit — even approximately — transforms latency from an emergent surprise into a managed constraint.

Embracing Stochastic Reality

What's important about these patterns, taken together, is what they have in common. None of them attempt to eliminate variability from the system. None of them assume that the environment can be made deterministic, or that every machine can be made equally fast, or that background noise can be suppressed. They all start from the premise that variability is irreducible — that shared environments will always produce straggler events — and design around it rather than against it.

"You cannot engineer your way to a deterministic distributed system. You can only engineer your way to one that degrades gracefully in the face of guaranteed non-determinism."

Old engineering intuition was that good infrastructure means predictable infrastructure — every component behaves the same way every time. New intuition is that good infrastructure means resilient infrastructure — every request completes within acceptable bounds, regardless of what any individual component is doing.

Each pattern acknowledges that something will go wrong and designs so that something going wrong in one place cannot become everything going wrong everywhere.

The question worth sitting with, for any system you're currently building, is not "what is our average latency?" It is: "when something goes wrong on one machine, where does that pain go?" If the answer is "it propagates to every user touching that partition," you have a tail latency problem, and the patterns above are where the solution starts.

Further reading: The ideas in this post are drawn from The Tail at Scale — Communications of the ACM, February 2013. Worth reading in full if any of this resonated.