LLM-backed support automation agent with EvalView regression guardrails.
This starter is built around a problem most agent teams actually have:
the endpoint still returns 200, the answer still sounds plausible, but the agent silently changed tool use or skipped a key support workflow
The agent handles three realistic support scenarios:
- refund requests
- billing disputes
- VIP outage triage
Most support-agent demos prove that an agent can answer a ticket. This repo proves something more useful:
can you catch the bad deploy before it hits customers?
That is why the repo ships with:
- committed EvalView tests
- committed baselines
- a clean regression mode
- a real LLM-backed path for teams that want credibility
No API key:
python -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt
pip install evalview
make runKeep that terminal running. In a second terminal:
source .venv/bin/activate
make checkReal LLM-backed path:
cp .env.example .env
export OPENAI_API_KEY=your-key
make llmKeep that terminal running. In a second terminal:
source .venv/bin/activate
evalview snapshot tests
evalview check testsIn baseline mode it uses tools the way you would expect a real support agent to behave. In regression mode it makes two subtle but costly mistakes:
- it escalates every refund after already issuing the refund
- it skips billing-history lookup and gives a vague follow-up answer
- FastAPI HTTP agent with a real tool-using LLM path
- editable support playbook in
agent/support_playbook.md - deterministic fallback mode so the repo still runs without API keys
- EvalView regression tests for refund, billing, and outage flows
- committed golden baselines for the fallback mode
- GitHub Actions regression check
- one-command baseline vs regression demo
The repo supports two backends:
mock- deterministic fallback
- no API key needed
- used for the committed goldens and CI by default
openai- real LLM-backed support agent
- requires
OPENAI_API_KEY - uses tool calling to choose actions and produce the final response
Backend selection:
- default:
auto - if
OPENAI_API_KEYis set, the app uses the OpenAI backend - otherwise it falls back to the deterministic backend
You can also force it:
AGENT_BACKEND=mock
AGENT_BACKEND=openaipython -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt
pip install evalview
make runKeep that terminal running. In a second terminal:
source .venv/bin/activate
make checkThat should pass against the committed baseline in mock mode.
export OPENAI_API_KEY=your-key
make llmThis starts the same support agent, but now the model decides:
- when to ask for missing information
- which tools to call
- what to say after the tool results come back
- how to follow your support playbook in
agent/support_playbook.md
If you want to use EvalView against that LLM-backed behavior, run:
evalview snapshot tests
evalview check testsThat will create baselines for your local model-backed version of the agent.
The fastest way to adapt this repo is to edit:
agent/support_playbook.md- policy and escalation rules
tests/*.yaml- the user flows you care about
agent/app.py- tool implementations and business data
This keeps the example useful for real teams: you can swap the support policy and keep the same EvalView regression loop.
Mock backend:
make regress
evalview check testsLLM backend:
export OPENAI_API_KEY=your-key
make llm-regress
evalview check testsYou should see:
TOOLS_CHANGEDon the refund flow becauseescalate_to_humanwas addedREGRESSIONon the billing dispute flow because the agent stops callingcheck_billing_history
- refund flow:
- unnecessary human escalation after a successful refund
- billing dispute:
- missing billing-history lookup before answering
- VIP outage:
- correct verification and escalation path for high-priority support
These are the kinds of regressions that still sound plausible in manual spot checks, but should absolutely block deploys.
This is a support agent people can actually relate to:
- it asks for missing information before acting
- it checks history before answering disputes
- it escalates high-priority incidents correctly
- it shows how regressions can be operational, not just textual
That makes it a good fit for teams evaluating whether EvalView can guard real production workflows.
agent/app.py: FastAPI support automation agent with OpenAI tool-calling backend + mock fallbackagent/support_playbook.md: editable support policy the LLM followstests/: EvalView regression tests.evalview/config.yaml: local adapter config.evalview/golden/: committed mock-mode baselines.github/workflows/evalview.yml: CI workflowMakefile: simple local commands
If you intentionally change the agent and want to keep the new behavior:
evalview snapshot testsThen commit the updated .evalview/golden/ files.