Eval-Driven Development

Q: How is eval-driven development different from test-driven development?

TDD uses binary pass/fail criteria that work for deterministic code. AI systems are probabilistic—outputs vary across runs, models, and prompts. Eval-driven development requires defining success thresholds before writing tests: what score is good enough? What regression is acceptable? That threshold-setting step is the part that experienced test-driven developers do intuitively but rarely formalize. EDD makes it explicit and mandatory.

Q: Isn't this just MLOps?

MLOps treats evaluation as a deployment and monitoring concern. Eval-driven development makes it a development practice that precedes writing code, not something bolted on after. The eval comes first—before the prompt, before the pipeline, before the model selection. MLOps asks "is it still working?" EDD asks "how do we know it works at all?"

Q: You can't evaluate subjective AI outputs.

You can. Define rubrics, use LLM-as-judge, measure consistency across runs. "Subjective" usually means "we haven't defined our criteria yet"—which is exactly the problem eval-driven development solves. If you can't articulate what good looks like, you can't build toward it.

Q: Evals are too slow and expensive to run in CI.

Tier them. Fast, cheap smoke evals on every commit. Comprehensive suites nightly. Same pattern as unit tests versus integration tests. The cost of not running evals is shipping regressions to users—that's more expensive.

Q: My use case is just one API call. I don't need this.

That call will regress when the model updates, the prompt drifts, or the context changes. The simpler the integration, the easier the eval—no excuse not to have one.

Q: How is this different from A/B testing?

A/B testing experiments on users post-deploy. Evals catch problems pre-deploy, deterministically, without shipping broken experiences to real people. A/B testing tells you which version users prefer. Evals tell you whether either version is good enough to ship.

Our focus as technologists must shift from what we can build to what we can prove.

Software development is now agent-driven. AI writes the code. The engineer's job is no longer to produce working software — it is to define what "working" means, measure it, and hold the system to that definition.

We propose Eval-Driven Development: a discipline where every probabilistic system starts with a specification of correctness, and nothing ships without automated proof that it meets that spec.

Definitions

An eval is a dataset, a grader, and a harness. The grader and harness are built before you write code. The dataset evolves from synthetic to production-representative. The commitment to measure is there from the start, even if the full specifics cannot be known at project inception.

Principles

1. Evaluation is the product

The eval suite is not a phase that follows development. Build evals first. Code is generated. Evals are engineered.

2. Define correctness before you write a prompt

If you cannot express "correct" as a deterministic function, you are not ready to build. Every task needs an eval. Every eval needs a threshold. Every threshold needs a justification.

3. Probabilistic systems require statistical proof

A single passing test proves nothing about a stochastic system. You need sample sizes, confidence intervals, and regression baselines. Measure distributions, not anecdotes.

4. Evals must run in CI

If your evals do not run on every change, they do not exist. Evaluation belongs in the pipeline next to lint, type-check, and build — not in a notebook someone runs quarterly.

5. Evaluation drives architecture

The eval suite determines the system boundary. If a component cannot be independently evaluated, it cannot be independently trusted. Design for measurability like you design for testability.

6. Cost is a metric

Token spend, latency, and compute are evaluation dimensions. A system that is correct but unaffordable has failed its eval.

7. Human judgment does not scale — codify it

Every manual review is a missing eval. When a human judges output quality, extract that judgment into a rubric, automate the rubric, then evaluate the evaluator.

8. Ship the eval, not the demo

A demo proves something can work once. An eval proves it works reliably under distribution shift. Demos convince stakeholders. Evals convince engineers.

9. Version your evals like you version your code

Eval definitions, datasets, thresholds, and results live in version control. They have changelogs. When the eval changes, the reason is documented.

10. The eval gap is the opportunity

Most teams ship AI without rigorous evaluation. The gap between "it works on my machine" and "it passes eval at p < 0.05" is where defensible products get built.

FAQ

How is eval-driven development different from test-driven development?

TDD uses binary pass/fail criteria that work for deterministic code. AI systems are probabilistic—outputs vary across runs, models, and prompts. Eval-driven development requires defining success thresholds before writing tests: what score is good enough? What regression is acceptable? That threshold-setting step is the part that experienced test-driven developers do intuitively but rarely formalize. EDD makes it explicit and mandatory.

Isn't this just MLOps?

MLOps treats evaluation as a deployment and monitoring concern. Eval-driven development makes it a development practice that precedes writing code, not something bolted on after. The eval comes first—before the prompt, before the pipeline, before the model selection. MLOps asks "is it still working?" EDD asks "how do we know it works at all?"

You can't evaluate subjective AI outputs.

You can. Define rubrics, use LLM-as-judge, measure consistency across runs. "Subjective" usually means "we haven't defined our criteria yet"—which is exactly the problem eval-driven development solves. If you can't articulate what good looks like, you can't build toward it.

Evals are too slow and expensive to run in CI.

Tier them. Fast, cheap smoke evals on every commit. Comprehensive suites nightly. Same pattern as unit tests versus integration tests. The cost of not running evals is shipping regressions to users—that's more expensive.

My use case is just one API call. I don't need this.

That call will regress when the model updates, the prompt drifts, or the context changes. The simpler the integration, the easier the eval—no excuse not to have one.

How is this different from A/B testing?

A/B testing experiments on users post-deploy. Evals catch problems pre-deploy, deterministically, without shipping broken experiences to real people. A/B testing tells you which version users prefer. Evals tell you whether either version is good enough to ship.

Signatories

Star to sign. Fork to create your own.

3 watching · 4 forks