A/B test and audit your AI custom instructions. Measure whether your instruction changes actually improve outputs, with data instead of vibes.
instreval runs your prompts through an LLM with different instruction sets and scores the outputs on dimensions you define. It answers the question: "I changed my custom instructions -- did my outputs get better or worse?"
Three modes:
- compare -- A/B test two instruction sets against the same prompts and scoring criteria
- audit -- Evaluate a single instruction set; auto-generates test prompts and scoring dimensions
- suggest -- Generate specific instruction improvements based on eval results
pip install instreval
Or for development:
git clone https://github.com/dvelton/instreval.git
cd instreval
pip install -e ".[dev]"
Point it at your custom instructions file. It generates test prompts, runs them, scores the output, and suggests improvements.
export ANTHROPIC_API_KEY=sk-... # or OPENAI_API_KEY, etc.
instreval audit my-instructions.txt --model claude-sonnet-4-20250514
Create an eval config (YAML) with two instruction sets, test prompts, and scoring criteria:
model: "claude-sonnet-4-20250514"
instructions:
baseline: baseline.txt
candidate: candidate.txt
prompts:
- id: memo-draft
text: "Draft a memo summarizing the risks of adopting third-party AI models."
- id: email-response
text: "Write a response to a customer asking about our data retention policy."
scoring:
dimensions:
- name: conciseness
type: llm_judge
criteria: "Rate how concise the response is. Penalize filler and repetition."
- name: ai_patterns
type: rule_based
rules:
- pattern: " — "
name: em_dash_overuse
max_allowed: 2
- pattern: "(?i)it's not .+—.?it's"
name: negative_parallelism
max_allowed: 0
runs: 3Then run:
instreval compare eval.yaml --save results.json
Generate improvement suggestions from saved results:
instreval suggest results.json my-instructions.txt --model claude-sonnet-4-20250514
rule_based -- Pattern matching with regex. Runs locally, zero API cost. Good for catching specific text patterns (em dash overuse, rhetorical structures, formatting violations).
llm_judge -- Uses an LLM to score outputs on subjective criteria you define. Costs API calls. Good for harder-to-quantify dimensions like conciseness, tone, actionability, naturalness.
instreval uses LiteLLM under the hood, so it works with any provider:
- OpenAI:
gpt-4o,gpt-4o-mini, etc. - Anthropic:
claude-sonnet-4-20250514,claude-haiku-4-20250514, etc. - Google:
gemini/gemini-pro, etc. - Azure, AWS Bedrock, Ollama, and 100+ more
Set the appropriate API key as an environment variable and specify the model string.
- Loads your instruction sets and eval config
- Runs each test prompt against each instruction set N times (configurable, default 3)
- Scores each output on every dimension (rule-based checks run locally; LLM-judge calls the model)
- Produces a comparison report with per-dimension and per-prompt breakdowns
- Diagnoses weaknesses with specific evidence
- Optionally generates improvement suggestions with rationale
Multiple runs per prompt account for output variance -- AI responses aren't deterministic, so a single sample isn't reliable.
See examples/writing-style/ for a complete example that tests whether adding anti-AI-slop writing rules to instructions actually reduces AI writing patterns in outputs.
PRs welcome. Areas where contributions would be particularly useful:
- New scorer types (readability metrics, sentiment analysis, factual consistency)
- Preset scoring dimensions for common use cases
- Better report formats (markdown export, HTML)
- Preset test prompt libraries for common instruction categories
MIT