Evaliphy is currently in beta. It is not recommended for production use yet. Please try it out and share your feedback.
Open Source AI E2E Testing (Open in Beta)

Simplify End-to-End AI Testing

Like Playwright for your AI system. Write assertions, run in CI, and get human-readable reports. No ML overhead. No vendor lock-in.

return-policy.eval.ts
import { evaluate, expect } from 'evaliphy';

evaluate("RAG /api/chat: return policy answer is faithful",
  async ({httpClient}) => {
    const query = "What is your return policy?";
    const res = await httpClient.post("/api/chat", {message: query});
    const data = await res.json<ChatResponse>();

    await expect(query, data.context, data.answer).toBeFaithful({
        threshold: 0.8,
    });

    await expect(query, data.context, data.answer).toBeRelevant();
    await expect(query, data.context, data.answer).toBeGrounded();
    await expect(data.answer).toBeHarmless();
    await expect(data.answer).toBeCoherent();
});
Works with
OpenAIAnthropicOpenRouterMistralVercel

Why Evaliphy Exists

We built Evaliphy because AI testing should feel as straightforward as API testing: write assertions, run checks in CI, and get clear reports that drive immediate action.

The Core Gap

  • ✅ Your teams already ship with assertion-based tests
  • ✅ CI/CD already enforces quality for every release
  • ❌ AI testing often drifts into notebook-heavy, research-first workflows
  • ❌ Results are frequently too metric-heavy for fast product decisions
How Evaliphy Works

Human-readable evaluation reports

Get detailed, human-readable reports with LLM-judge reasoning.

Evaliphy Demo
Evaliphy Evaluation Report

Four Reasons Teams Choose Evaliphy

Familiar Mental Model

Test AI the same way you test APIs. Use assertions your team understands:toBeFaithful(),toBeRelevant(), andtoBeGrounded().

No Vendor Lock-In

Open source by default. Bring your own provider, own your test data, and run anywhere from local to CI.

No ML Overhead

No notebooks, no tuning pipelines, and no research stack. Just write assertions, run tests, and review results.

Human-Readable Reports

Understand failures quickly with plain-language reasoning and clear pass/fail outcomes you can act on in CI. Run with the standardnpx evaliphy run command.

Quick Comparison

Compare approaches, not just tools.

AspectEvaliphyResearch ToolsPrompt Testing
Mental ModelAssertions like API testsResearch and optimizationPrompt iteration loops
WorkflowCI/CD pipelineNotebook and experimentsCLI or web prompt runs
Setup TimeMinutesHoursMinutes
ML Knowledge RequiredNoneSignificantMinimal
Vendor Lock-InNone (open source)PossiblePossible
Best ForProduction AI testing in CIBenchmarking and fine-tuningPrompt engineering

AI Testing Like Everything Else

Same Assertion Mindset

Use familiar expectations for AI quality, not new research paradigms.

  • • Playwright tests UI flows
  • • Evaliphy tests AI responses
  • • Both use clear pass/fail assertions

Open by Design

No proprietary lock-in. Keep ownership of your tests, results, and workflows.

  • • Open source framework
  • • Works with major LLM providers
  • • Run local or inside CI/CD

Built for Shipping Teams

Move from manual checks to repeatable AI quality gates before release.

  • • Catch regressions before users do
  • • Share human-readable reports across teams
  • • Keep AI testing in your normal workflow

Ready to simplify AI testing?

Start testing any AI system with simple assertions, CI integration, and reports your team can read instantly.

$npm install -g evaliphy