instreval

A/B test and audit your AI custom instructions. Measure whether your instruction changes actually improve outputs, with data instead of vibes.

What it does

instreval runs your prompts through an LLM with different instruction sets and scores the outputs on dimensions you define. It answers the question: "I changed my custom instructions -- did my outputs get better or worse?"

Three modes:

compare -- A/B test two instruction sets against the same prompts and scoring criteria
audit -- Evaluate a single instruction set; auto-generates test prompts and scoring dimensions
suggest -- Generate specific instruction improvements based on eval results

Install

pip install instreval

Or for development:

git clone https://github.com/dvelton/instreval.git
cd instreval
pip install -e ".[dev]"

Quick start

Audit mode (easiest way to start)

Point it at your custom instructions file. It generates test prompts, runs them, scores the output, and suggests improvements.

export ANTHROPIC_API_KEY=sk-...   # or OPENAI_API_KEY, etc.

instreval audit my-instructions.txt --model claude-sonnet-4-20250514

Compare mode

Create an eval config (YAML) with two instruction sets, test prompts, and scoring criteria:

model: "claude-sonnet-4-20250514"

instructions:
  baseline: baseline.txt
  candidate: candidate.txt

prompts:
  - id: memo-draft
    text: "Draft a memo summarizing the risks of adopting third-party AI models."
  - id: email-response
    text: "Write a response to a customer asking about our data retention policy."

scoring:
  dimensions:
    - name: conciseness
      type: llm_judge
      criteria: "Rate how concise the response is. Penalize filler and repetition."
    - name: ai_patterns
      type: rule_based
      rules:
        - pattern: " — "
          name: em_dash_overuse
          max_allowed: 2
        - pattern: "(?i)it's not .+—.?it's"
          name: negative_parallelism
          max_allowed: 0

runs: 3

Then run:

instreval compare eval.yaml --save results.json

Suggest mode

Generate improvement suggestions from saved results:

instreval suggest results.json my-instructions.txt --model claude-sonnet-4-20250514

Scoring types

rule_based -- Pattern matching with regex. Runs locally, zero API cost. Good for catching specific text patterns (em dash overuse, rhetorical structures, formatting violations).

llm_judge -- Uses an LLM to score outputs on subjective criteria you define. Costs API calls. Good for harder-to-quantify dimensions like conciseness, tone, actionability, naturalness.

Model support

instreval uses LiteLLM under the hood, so it works with any provider:

OpenAI: gpt-4o, gpt-4o-mini, etc.
Anthropic: claude-sonnet-4-20250514, claude-haiku-4-20250514, etc.
Google: gemini/gemini-pro, etc.
Azure, AWS Bedrock, Ollama, and 100+ more

Set the appropriate API key as an environment variable and specify the model string.

How it works

Loads your instruction sets and eval config
Runs each test prompt against each instruction set N times (configurable, default 3)
Scores each output on every dimension (rule-based checks run locally; LLM-judge calls the model)
Produces a comparison report with per-dimension and per-prompt breakdowns
Diagnoses weaknesses with specific evidence
Optionally generates improvement suggestions with rationale

Multiple runs per prompt account for output variance -- AI responses aren't deterministic, so a single sample isn't reliable.

Example

See examples/writing-style/ for a complete example that tests whether adding anti-AI-slop writing rules to instructions actually reduces AI writing patterns in outputs.

Contributing

PRs welcome. Areas where contributions would be particularly useful:

New scorer types (readability metrics, sentiment analysis, factual consistency)
Preset scoring dimensions for common use cases
Better report formats (markdown export, HTML)
Preset test prompt libraries for common instruction categories

License

MIT

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
examples/writing-style		examples/writing-style
src/instreval		src/instreval
tests		tests
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

instreval

What it does

Install

Quick start

Audit mode (easiest way to start)

Compare mode

Suggest mode

Scoring types

Model support

How it works

Example

Contributing

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

instreval

What it does

Install

Quick start

Audit mode (easiest way to start)

Compare mode

Suggest mode

Scoring types

Model support

How it works

Example

Contributing

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages