0% found this document useful (0 votes)
10 views6 pages

RAG System Design Notes

Uploaded by

Jan Berde
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
10 views6 pages

RAG System Design Notes

Uploaded by

Jan Berde
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

RAG System

From data ingestion to evaluation


Design Notes

© 2025 Concise class notes for engineers building retrieval augmented generation.
1. End to end pipeline
Overview
Ingest Chunk Embed Index Retrieve Rerank Generate Post process Eva
Design for observability: store query, retrieved IDs, scores, and final output with
versioned embeddings.

Ingest Chunk Embed Index Retrieve

Rerank Generate Evaluate

Keep artifacts versioned: docs, chunks, embedding model, index params, prompts.
2. Chunking & embeddings
2.1 Chunking
Semantic + structural chunking (headings, code blocks).
Overlap only as needed (e.g., 10 15%); store metadata: source, section, timestamp.
2.2 Embeddings
Use domain fit models where possible; normalize vectors; monitor drift after model upgra
Store text, vector, and hash of pre processing pipeline for reproducibility.

Index hygiene beats parameter tweaking. Garbage in garbage out.


3. Retrieval & reranking
3.1 Retrieval recipes
Hybrid (BM25 + vector) improves recall; add semantic filters on metadata (time, author).
Multi query expansion: rewrite user query into N paraphrases and merge top k results.
// Pseudocode: hybrid search
dense = [Link](query, k=20)
sparse = [Link](query, k=20)
results = rerank(dense sparse)[:k]

3.2 Reranking
Cross encoder rerankers can boost precision@k; cache aggressively.

Aim for high recall first, then increase precision with rerankers and filters.
4. Generation & guardrails
Prompt shape
Instructions + citations requirement + JSON schema for answers.
Insert retrieved chunks with clear separators; limit max tokens.
SYSTEM: Answer using ONLY supplied context. If missing, say you don t know.
CONTEXT:
<<<chunk 1>>>
<<<chunk 2>>>
OUTPUT: JSON {"answer": string, "citations": [doc_id], "confidence": 0..1}

Guardrails
Grounding check: ask the model to quote exact spans before answering.
Toxicity/redaction passes on output; domain allow lists for sources.

Schema bound outputs reduce hallucinations and simplify UI rendering.


5. Evaluation & observability
Offline eval
IR metrics: recall@k, nDCG; QA metrics: answerable/unanswerable, exact match, citation
accuracy.
Use a labeled set of question gold spans; update monthly.
Online eval
Collect user feedback; detect answer changes vs. baseline; monitor latency and cost per
query.
Shadow deploy new indexes/models; A/B test prompt variants.

Ship dashboards: retrieval quality, latency, costs, and safety incidents.

You might also like