RAG System
From data ingestion to evaluation
Design Notes
© 2025 Concise class notes for engineers building retrieval augmented generation.
1. End to end pipeline
Overview
Ingest Chunk Embed Index Retrieve Rerank Generate Post process Eva
Design for observability: store query, retrieved IDs, scores, and final output with
versioned embeddings.
Ingest Chunk Embed Index Retrieve
Rerank Generate Evaluate
Keep artifacts versioned: docs, chunks, embedding model, index params, prompts.
2. Chunking & embeddings
2.1 Chunking
Semantic + structural chunking (headings, code blocks).
Overlap only as needed (e.g., 10 15%); store metadata: source, section, timestamp.
2.2 Embeddings
Use domain fit models where possible; normalize vectors; monitor drift after model upgra
Store text, vector, and hash of pre processing pipeline for reproducibility.
Index hygiene beats parameter tweaking. Garbage in garbage out.
3. Retrieval & reranking
3.1 Retrieval recipes
Hybrid (BM25 + vector) improves recall; add semantic filters on metadata (time, author).
Multi query expansion: rewrite user query into N paraphrases and merge top k results.
// Pseudocode: hybrid search
dense = [Link](query, k=20)
sparse = [Link](query, k=20)
results = rerank(dense sparse)[:k]
3.2 Reranking
Cross encoder rerankers can boost precision@k; cache aggressively.
Aim for high recall first, then increase precision with rerankers and filters.
4. Generation & guardrails
Prompt shape
Instructions + citations requirement + JSON schema for answers.
Insert retrieved chunks with clear separators; limit max tokens.
SYSTEM: Answer using ONLY supplied context. If missing, say you don t know.
CONTEXT:
<<<chunk 1>>>
<<<chunk 2>>>
OUTPUT: JSON {"answer": string, "citations": [doc_id], "confidence": 0..1}
Guardrails
Grounding check: ask the model to quote exact spans before answering.
Toxicity/redaction passes on output; domain allow lists for sources.
Schema bound outputs reduce hallucinations and simplify UI rendering.
5. Evaluation & observability
Offline eval
IR metrics: recall@k, nDCG; QA metrics: answerable/unanswerable, exact match, citation
accuracy.
Use a labeled set of question gold spans; update monthly.
Online eval
Collect user feedback; detect answer changes vs. baseline; monitor latency and cost per
query.
Shadow deploy new indexes/models; A/B test prompt variants.
Ship dashboards: retrieval quality, latency, costs, and safety incidents.