Inspiration
Every LLM-powered application shares the same silent killer — hallucinations. When a model confidently states something factually wrong, users lose trust instantly. But most teams only discover this after the damage is done, through user complaints or manual review. There was no lightweight, real-time way to catch hallucinations as they happen, track when a model starts degrading, and get actionable recommendations to fix it. We built Hallucination Watchdog to solve exactly that.
What it does
Hallucination Watchdog is a real-time LLM quality monitoring agent. You send it any (query, context, response) triple from your LLM pipeline, and it instantly scores the response for factual grounding. It returns a verdict — HALLUCINATION or FACTUAL — along with a confidence score and a plain English explanation of why. Beyond single evaluations, it tracks every result over time and detects drift — identifying when your hallucination rate is trending up before users notice. It then generates concrete recommendations: improve your retrieval context, harden your prompt, or switch to a more grounded model variant. The system exposes four endpoints — /evaluate, /drift, /worst-offenders, and /agent — where the agent endpoint accepts natural language and lets Gemini decide which tools to call. Everything is traced to Arize Phoenix Cloud for full observability.
How we built it
We built Hallucination Watchdog on four core technologies: Google ADK + Gemini 2.5 Flash forms the agent layer. The agent is built with Google's Agent Development Kit and powered by Gemini 2.5 Flash. It has three tools — evaluate a response, check drift, and surface worst offenders — and uses the Runner pattern to execute multi-step reasoning in a single natural language call. Arize Phoenix FaithfulnessEvaluator is the evaluation engine. It takes a query, reference context, and LLM response, and uses Gemini as a judge LLM to determine whether the response is faithful to the context. Every evaluation runs in a thread executor so it doesn't block the async server. OpenTelemetry + Arize Phoenix Cloud handles observability. Every evaluation is enriched as an OTel span with verdict, score, latency, and span ID as attributes, then exported to Phoenix Cloud where they become queryable traces. FastAPI on Google Cloud Run is the deployment layer. The server exposes a clean REST API and is deployed on Cloud Run with always-warm instances to avoid cold start latency during demos.
Challenges we ran into
The biggest challenge was navigating the rapidly evolving Phoenix evals API. The HallucinationEvaluator we originally built around was deprecated mid-build in favor of FaithfulnessEvaluator, which had a completely different method signature — accepting a dict via eval_input rather than keyword arguments. Debugging this required inspecting the evaluator's input schema directly at runtime. We also hit significant friction with university organization policies on Google Cloud, which blocked Secret Manager access and restricted public IAM bindings on Cloud Run. We worked around this by passing secrets as environment variables and using identity token authentication for the deployed service. Windows-specific issues added complexity — long path errors during pip install, PowerShell syntax differences for multi-line commands, and environment variables not persisting across terminal sessions all required workarounds that wouldn't exist on Linux or Mac. Finally, reconciling two different authentication models — Vertex AI OAuth credentials for the ADK agent and a direct API key for the Phoenix evaluator's judge LLM — required careful separation of concerns in the codebase.
Accomplishments that we're proud of
We are most proud of the self-improvement loop. The agent doesn't just evaluate responses and log results — it actively reads its own evaluation history at runtime to detect when its quality is degrading and recommends concrete fixes. An agent that monitors itself and improves itself is genuinely novel and directly aligned with what Arize rewards in their platform. We are also proud of the end-to-end traceability. Every single evaluation — verdict, score, latency, span ID — flows through OpenTelemetry into Phoenix Cloud, making every decision the agent makes fully auditable and queryable. This isn't just a demo; it's production-grade observability. Finally, shipping a working Cloud Run deployment despite significant university org policy restrictions was a real accomplishment that required creative problem-solving under pressure.
What we learned
We learned that LLM observability is genuinely hard to get right. Logging a verdict is easy — but making that verdict queryable, trend-able, and actionable across thousands of evaluations requires careful span design from the start. We learned the importance of async-safe evaluation pipelines. The Phoenix evaluator is synchronous and blocking, and running it naively in an async FastAPI server would have caused serious concurrency issues. Wrapping it in a ThreadPoolExecutor was a critical architectural decision. We also learned that university cloud environments are significantly more restrictive than personal accounts, and that building for production constraints — even artificial ones — makes you a better engineer.
What's next for Hallucination watchdog
Webhook alerts — automatically notify a Slack channel or PagerDuty when hallucination rate crosses a threshold, without anyone having to manually check the dashboard. Multi-model comparison — send the same query to multiple models simultaneously and get a side-by-side hallucination scorecard, helping teams pick the most reliable model for their use case. Prompt repair suggestions — when a hallucination is detected, automatically generate an improved prompt that constrains the model to the provided context, using Gemini to rewrite the instruction. RAG pipeline integration — a drop-in middleware layer for LangChain and LlamaIndex pipelines that automatically intercepts every generation and scores it before returning to the user. Fine-tuning dataset export — export all HALLUCINATION-labeled examples as a structured dataset for fine-tuning a smaller, domain-specific model that hallucinates less on your specific use case.
Built With
- arize
- fastapi
- gemini2.5
- googleadk
- opentelemtery


Log in or sign up for Devpost to join the conversation.