Like Playwright for your AI system. Write assertions, run in CI, and get human-readable reports. No ML overhead. No vendor lock-in.
import { evaluate, expect } from 'evaliphy';
evaluate("RAG /api/chat: return policy answer is faithful",
async ({httpClient}) => {
const query = "What is your return policy?";
const res = await httpClient.post("/api/chat", {message: query});
const data = await res.json<ChatResponse>();
await expect(query, data.context, data.answer).toBeFaithful({
threshold: 0.8,
});
await expect(query, data.context, data.answer).toBeRelevant();
await expect(query, data.context, data.answer).toBeGrounded();
await expect(data.answer).toBeHarmless();
await expect(data.answer).toBeCoherent();
});
We built Evaliphy because AI testing should feel as straightforward as API testing: write assertions, run checks in CI, and get clear reports that drive immediate action.

Get detailed, human-readable reports with LLM-judge reasoning.


Test AI the same way you test APIs. Use assertions your team understands:toBeFaithful(),toBeRelevant(), andtoBeGrounded().
Open source by default. Bring your own provider, own your test data, and run anywhere from local to CI.
No notebooks, no tuning pipelines, and no research stack. Just write assertions, run tests, and review results.
Understand failures quickly with plain-language reasoning and clear pass/fail outcomes you can act on in CI. Run with the standardnpx evaliphy run command.
Compare approaches, not just tools.
| Aspect | Evaliphy | Research Tools | Prompt Testing |
|---|---|---|---|
| Mental Model | Assertions like API tests | Research and optimization | Prompt iteration loops |
| Workflow | CI/CD pipeline | Notebook and experiments | CLI or web prompt runs |
| Setup Time | Minutes | Hours | Minutes |
| ML Knowledge Required | None | Significant | Minimal |
| Vendor Lock-In | None (open source) | Possible | Possible |
| Best For | Production AI testing in CI | Benchmarking and fine-tuning | Prompt engineering |
Use familiar expectations for AI quality, not new research paradigms.
No proprietary lock-in. Keep ownership of your tests, results, and workflows.
Move from manual checks to repeatable AI quality gates before release.
Start testing any AI system with simple assertions, CI integration, and reports your team can read instantly.