A comprehensive evaluation framework for LLMs and RAG pipelines β hallucination detection, LLM-as-a-judge scoring, semantic similarity, and a live dashboard.
β‘ Quick Start Β· π§ Evaluation Methods Β· π Dashboard Β· π What It Evaluates
Dashboard showing a 10-case evaluation run β Overall: 0.874 Β· Accuracy: 0.896 Β· Hallucination: 1.000 Β· Relevance: 0.950 Β· Coherence: 0.650
A modular, extensible evaluation framework for measuring the quality of LLM responses and RAG pipelines. Goes beyond simple string matching β combines LLM-as-a-judge scoring, semantic similarity, faithfulness checks, BLEU/ROUGE metrics, and cost/latency tracking, with all results surfaced in a live interactive dashboard.
Built for AI engineers who need rigorous, reproducible evaluation β not just vibes.
| Target | What Gets Measured |
|---|---|
| RAG Pipelines | Retrieval quality, answer groundedness, context coverage |
| LLM Response Quality | Accuracy, relevance, coherence, completeness |
| Hallucination Detection | Faithfulness to source β flags unsupported claims |
| Prompt A/B Testing | Side-by-side comparison of prompt variants |
| Custom Tasks | Pluggable scoring for domain-specific evaluation needs |
The interactive dashboard surfaces all evaluation results in real time:
- Summary cards β Overall, Accuracy, Hallucination, Relevance, Coherence scores at a glance
- Score Distribution β per-question bar chart across all metrics
- Radar Chart β averaged score profile across all dimensions
- Individual Results β drilldown into each test case
- Filters β filter by category and minimum overall score
- Evaluation run selector β compare across multiple saved runs
Uses GPT or Claude as an automated judge to score responses on accuracy, relevance, and coherence β returning structured scores with reasoning. Configurable judge model and scoring rubric.
Computes embedding-based cosine similarity between generated responses and reference answers. Catches paraphrased correct answers that BLEU/ROUGE would miss.
Verifies that every claim in the generated response is supported by the retrieved context. Flags hallucinated statements not grounded in source documents.
Classic n-gram overlap metrics for benchmarking against reference answers β useful for regression testing and comparison against baselines.
Pluggable scoring interface β define any task-specific metric and plug it into the evaluation pipeline without modifying core logic.
Records end-to-end response time and estimated token cost per evaluation run β essential for comparing models and prompt variants at scale.
Input (prompt + context + reference answer)
β
βΌ
Evaluation Pipeline
βββ LLM-as-a-Judge β accuracy, relevance, coherence scores
βββ Semantic Similarity β embedding cosine similarity
βββ Faithfulness Check β hallucination detection
βββ BLEU / ROUGE β n-gram overlap scores
βββ Custom Scorers β pluggable task-specific metrics
βββ Latency / Cost β token usage + response time
β
βΌ
Results Aggregator β saves to results/eval_<timestamp>.json
β
βΌ
Dashboard UI β interactive live visualisation
- Python 3.10+
- OpenAI API key (for LLM-as-a-judge)
git clone https://github.com/overcastbulb/llm-eval-framework.git
cd llm-eval-framework
python -m venv .venv
source .venv/bin/activate # macOS/Linux
# .\.venv\Scripts\activate # Windows
pip install -r requirements.txtcp .env.example .env
# Add your OPENAI_API_KEY to .envfrom eval import EvalPipeline
pipeline = EvalPipeline()
result = pipeline.evaluate(
prompt="What are Apple's main risk factors?",
response="Apple faces risks including supply chain disruption...",
context="Apple's 10-K states that supply chain concentration...",
reference="Key risks include supply chain, competition, and regulation..."
)
print(result.scores)
# {
# "overall": 0.874,
# "accuracy": 0.896,
# "hallucination": 1.000,
# "relevance": 0.950,
# "coherence": 0.650
# }python dashboard.pyOpen http://localhost:8050 β select any saved evaluation run from the sidebar to explore results.
llm-eval-framework/
βββ eval/
β βββ pipeline.py # Main evaluation orchestrator
β βββ judge.py # LLM-as-a-judge scorer
β βββ similarity.py # Semantic similarity (embeddings)
β βββ faithfulness.py # Hallucination / groundedness checks
β βββ ngram.py # BLEU / ROUGE metrics
β βββ custom.py # Pluggable custom scorer interface
β βββ tracking.py # Latency and cost tracking
βββ results/ # Saved evaluation run JSONs
βββ dashboard.py # Dashboard UI entry point
βββ .env.example
βββ requirements.txt
βββ README.md
| Component | Technology |
|---|---|
| Evaluation Engine | Python |
| LLM Judge | OpenAI GPT / Anthropic Claude |
| Embeddings | HuggingFace sentence-transformers |
| BLEU / ROUGE | nltk / rouge-score |
| Dashboard | Plotly Dash / Streamlit |
| Cost Tracking | OpenAI token usage API |
- Support for additional judge models (Gemini, local Ollama)
- Batch evaluation across large datasets
- CI/CD integration β run evals as a GitHub Actions step
- Evaluation dataset builder and versioning
- PDF report export
This project is licensed under the MIT License.
Report a Bug Β· Request a Feature
