Skip to content

overcastbulb/llm-eval-framework

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

4 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

πŸ§ͺ LLM Eval Framework

A comprehensive evaluation framework for LLMs and RAG pipelines β€” hallucination detection, LLM-as-a-judge scoring, semantic similarity, and a live dashboard.

Python OpenAI HuggingFace Dashboard License: MIT

⚑ Quick Start Β· 🧠 Evaluation Methods Β· πŸ“Š Dashboard Β· πŸ”Œ What It Evaluates


LLM Eval Framework Dashboard

Dashboard showing a 10-case evaluation run β€” Overall: 0.874 Β· Accuracy: 0.896 Β· Hallucination: 1.000 Β· Relevance: 0.950 Β· Coherence: 0.650


🌟 Overview

A modular, extensible evaluation framework for measuring the quality of LLM responses and RAG pipelines. Goes beyond simple string matching β€” combines LLM-as-a-judge scoring, semantic similarity, faithfulness checks, BLEU/ROUGE metrics, and cost/latency tracking, with all results surfaced in a live interactive dashboard.

Built for AI engineers who need rigorous, reproducible evaluation β€” not just vibes.


πŸ”Œ What It Evaluates

Target What Gets Measured
RAG Pipelines Retrieval quality, answer groundedness, context coverage
LLM Response Quality Accuracy, relevance, coherence, completeness
Hallucination Detection Faithfulness to source β€” flags unsupported claims
Prompt A/B Testing Side-by-side comparison of prompt variants
Custom Tasks Pluggable scoring for domain-specific evaluation needs

πŸ“Š Dashboard

The interactive dashboard surfaces all evaluation results in real time:

  • Summary cards β€” Overall, Accuracy, Hallucination, Relevance, Coherence scores at a glance
  • Score Distribution β€” per-question bar chart across all metrics
  • Radar Chart β€” averaged score profile across all dimensions
  • Individual Results β€” drilldown into each test case
  • Filters β€” filter by category and minimum overall score
  • Evaluation run selector β€” compare across multiple saved runs

🧠 Evaluation Methods

1. LLM-as-a-Judge

Uses GPT or Claude as an automated judge to score responses on accuracy, relevance, and coherence β€” returning structured scores with reasoning. Configurable judge model and scoring rubric.

2. Semantic Similarity

Computes embedding-based cosine similarity between generated responses and reference answers. Catches paraphrased correct answers that BLEU/ROUGE would miss.

3. Faithfulness / Groundedness Checks

Verifies that every claim in the generated response is supported by the retrieved context. Flags hallucinated statements not grounded in source documents.

4. BLEU / ROUGE Scores

Classic n-gram overlap metrics for benchmarking against reference answers β€” useful for regression testing and comparison against baselines.

5. Custom Scoring Functions

Pluggable scoring interface β€” define any task-specific metric and plug it into the evaluation pipeline without modifying core logic.

6. Latency & Cost Tracking

Records end-to-end response time and estimated token cost per evaluation run β€” essential for comparing models and prompt variants at scale.


πŸ—οΈ Architecture

Input (prompt + context + reference answer)
        β”‚
        β–Ό
Evaluation Pipeline
        β”œβ”€β”€ LLM-as-a-Judge       β†’ accuracy, relevance, coherence scores
        β”œβ”€β”€ Semantic Similarity   β†’ embedding cosine similarity
        β”œβ”€β”€ Faithfulness Check    β†’ hallucination detection
        β”œβ”€β”€ BLEU / ROUGE          β†’ n-gram overlap scores
        β”œβ”€β”€ Custom Scorers        β†’ pluggable task-specific metrics
        └── Latency / Cost        β†’ token usage + response time
        β”‚
        β–Ό
Results Aggregator β†’ saves to results/eval_<timestamp>.json
        β”‚
        β–Ό
Dashboard UI        β†’ interactive live visualisation

⚑ Quick Start

Prerequisites

  • Python 3.10+
  • OpenAI API key (for LLM-as-a-judge)

1. Clone and install

git clone https://github.com/overcastbulb/llm-eval-framework.git
cd llm-eval-framework
python -m venv .venv
source .venv/bin/activate    # macOS/Linux
# .\.venv\Scripts\activate   # Windows
pip install -r requirements.txt

2. Configure environment

cp .env.example .env
# Add your OPENAI_API_KEY to .env

3. Run an evaluation

from eval import EvalPipeline

pipeline = EvalPipeline()

result = pipeline.evaluate(
    prompt="What are Apple's main risk factors?",
    response="Apple faces risks including supply chain disruption...",
    context="Apple's 10-K states that supply chain concentration...",
    reference="Key risks include supply chain, competition, and regulation..."
)

print(result.scores)
# {
#   "overall": 0.874,
#   "accuracy": 0.896,
#   "hallucination": 1.000,
#   "relevance": 0.950,
#   "coherence": 0.650
# }

4. Launch the dashboard

python dashboard.py

Open http://localhost:8050 β€” select any saved evaluation run from the sidebar to explore results.


πŸ“ Project Structure

llm-eval-framework/
β”œβ”€β”€ eval/
β”‚   β”œβ”€β”€ pipeline.py            # Main evaluation orchestrator
β”‚   β”œβ”€β”€ judge.py               # LLM-as-a-judge scorer
β”‚   β”œβ”€β”€ similarity.py          # Semantic similarity (embeddings)
β”‚   β”œβ”€β”€ faithfulness.py        # Hallucination / groundedness checks
β”‚   β”œβ”€β”€ ngram.py               # BLEU / ROUGE metrics
β”‚   β”œβ”€β”€ custom.py              # Pluggable custom scorer interface
β”‚   └── tracking.py            # Latency and cost tracking
β”œβ”€β”€ results/                   # Saved evaluation run JSONs
β”œβ”€β”€ dashboard.py               # Dashboard UI entry point
β”œβ”€β”€ .env.example
β”œβ”€β”€ requirements.txt
└── README.md

πŸ› οΈ Tech Stack

Component Technology
Evaluation Engine Python
LLM Judge OpenAI GPT / Anthropic Claude
Embeddings HuggingFace sentence-transformers
BLEU / ROUGE nltk / rouge-score
Dashboard Plotly Dash / Streamlit
Cost Tracking OpenAI token usage API

πŸ—ΊοΈ Roadmap

  • Support for additional judge models (Gemini, local Ollama)
  • Batch evaluation across large datasets
  • CI/CD integration β€” run evals as a GitHub Actions step
  • Evaluation dataset builder and versioning
  • PDF report export

πŸ“„ License

This project is licensed under the MIT License.


Built for AI engineers who need rigorous, reproducible LLM evaluation.

Report a Bug Β· Request a Feature

About

Evaluation framework for LLMs and RAG pipelines LLM-as-a-judge scoring, hallucination detection, semantic similarity, BLEU/ROUGE, and a live dashboard.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages