SlimContext

Post-retrieval context optimizer for RAG — deterministic, fast, no LLM.

SlimContext is a standalone Python service for the retrieve → optimize → generate step in RAG. It takes over-fetched chunks from your vector DB, BM25, or hybrid retriever and returns a smaller set that maximizes information per token: deduplicated, topically grouped, diversity-ranked, and trimmed to a budget.

No LLM calls. Same input always yields the same output. Typical latency is single-digit to low hundreds of milliseconds depending on whether embeddings are supplied by the caller.

The problem

Production RAG usually over-fetches on purpose: retrieve 20–50 chunks, then hope the model figures it out. In practice:

30–40% of retrieved context is semantically redundant (same idea, different wording).
High-scoring chunks cluster on one theme (e.g. five “Redis is a cache” passages) while other useful topics never reach the LLM.
Token cost and latency scale with raw chunk count, not with useful facts.

Fetching fewer results from the vector DB hurts recall. The better pattern is:

Over-fetch for recall → optimize for precision and diversity → send to the LLM.

SlimContext is that optimization step, exposed as a small HTTP API.

Where it fits

Vector DB / BM25 / hybrid search
        ↓  (over-fetch: many chunks + scores + embeddings)
   SlimContext                    ← you are here
        ↓  (dedupe · cluster · MMR · budget)
        LLM

Architecture

End-to-end pipeline

┌─────────────┐   ┌──────────────────┐   ┌─────────────────────┐   ┌──────────────┐
│ Exact dedup │ → │ Semantic dedup   │ → │ Agglomerative       │ → │ Representative│
│ (hash)      │   │ (embedding +     │   │ cluster (cosine   │   │ (1 per cluster)│
│             │   │  lexical)        │   │  distance)│   │               │
└─────────────┘   └──────────────────┘   └─────────────────────┘   └───────┬──────┘
                                                                             ↓
┌─────────────┐   ┌──────────────────┐   ┌─────────────────────────────────┐
│ Compression │ ← │ Token budget     │ ← │ MMR (optional) or top-k by score │
│ (optional)  │   │ (pack in order)  │   │                                  │
└─────────────┘   └──────────────────┘   └─────────────────────────────────┘

Canonical RAG optimization path

Query → Over-fetch (N) → Cluster → Select → [MMR] (k) → LLM

SlimContext adds an explicit semantic dedup stage before clustering so paraphrases do not consume target_k slots.

What each layer does (and why)

Layer	What it does	Why it exists
1. Exact dedup	SHA-256 hash of normalized text, scoped by `namespace`.	Cheap, perfect removal of copy-paste duplicates from multi-source retrieval.
2. Embed (optional)	If any chunk lacks an embedding, encode all texts with `BAAI/bge-small-en-v1.5` (single vector space).	Clustering and MMR need vectors; callers with precomputed embeddings skip this entirely.
3. Semantic dedup	Pairwise paraphrase removal on chunks ranked by retrieval `score`: drop a lower-scored chunk when embeddings are nearly identical and TF-IDF / lexical overlap suggests the same claim (not merely the same topic).	Example: five passages all explain “Redis caches hot data in RAM” — keep the highest-scored one. Semantic dedup answers “have we already said this?”
4. Topical clustering	Agglomerative clustering on cosine distance between embeddings ( default linkage `average`, threshold `dedup_threshold`).	Groups semantically similar chunks so one representative can stand in for the cluster.
5. Representative selection	Pick the best chunk per cluster (`auto` = highest retrieval `score`, or centroid / query-closest / longest).	Reduces each topic to its strongest evidence before final selection.
6. MMR (optional)	When `enable_mmr=true` and candidates exceed `target_k`: Maximal Marginal Relevance `λ × relevance − (1−λ) × diversity_penalty`. When disabled: top `target_k` by retrieval `score`.	Balances relevance and diversity under `target_k` .
7. Compression	Light filler removal; structured text truncated with a placeholder.	Cuts noise without an LLM summarizer.
8. Token budget	Pack whole chunks in MMR order until `token_budget` is full; skip chunks that do not fit.	Hard cap for model context windows and cost control.

Design principle: optimize for useful coverage under a token budget, not minimum redundancy alone. A dedup engine returns one chunk; a context optimizer returns one chunk per distinct intent (up to target_k).

Why no LLM?

Approach	Deterministic	Typical latency	Cost
LLM compression / rerank	No	~500ms+	Per-token API
SlimContext	Yes	~2ms (precomputed embeddings) to ~600ms (server-side embed)	Compute only

Algorithms only: cosine distance, TF-IDF, agglomerative clustering, MMR. Auditable, testable, safe to run on every request.

Quick start

cd SlimContext
python -m venv .venv
.\.venv\Scripts\Activate.ps1
pip install -r requirements.txt
uvicorn app.api:app --reload

Health check: GET http://127.0.0.1:8000/health

API

`POST /v1/optimize`

Request (minimal):

{
  "query": "How does Redis improve performance in distributed systems?",
  "query_embedding": [0.81, 0.64, 0.72, 0.90],
  "namespace": "docs",
  "chunks": [
    {
      "id": "chunk-1",
      "text": "Redis stores hot data in memory...",
      "embedding": [0.80, 0.63, 0.71, 0.89],
      "score": 0.95
    }
  ],
  "target_k": 8,
  "token_budget": 1800,
  "semantic_dedup_threshold": 0.001,
  "dedup_threshold": 0.15,
  "cluster_linkage": "average",
  "enable_mmr": true,
  "mmr_lambda": 0.5,
  "representative_strategy": "auto",
  "compress": true
}

Response:

{
  "chunks": [{ "id": "chunk-1", "text": "...", "score": 0.95, "metadata": {} }],
  "stats": {
    "input_count": 21,
    "output_count": 7,
    "exact_duplicate_count": 1,
    "semantic_duplicate_count": 4,
    "cluster_count": 7,
    "input_tokens": 573,
    "output_tokens": 210,
    "reduction_pct": 63.35,
    "latency_ms": 12,
    "budget_skipped_count": 0
  }
}

Parameters

Field	Default	Role
`chunks`	required	Retrieved passages (`id`, `text`, optional `embedding`, `score`, `metadata`).
`namespace`	`"default"`	Isolates exact-dedup hashes across tenants / indexes.
`query`	`""`	Used to embed a query vector when `query_embedding` is omitted.
`query_embedding`	optional	Query vector for MMR relevance; preferred when the retriever already has it.
`target_k`	`8`	Max chunks after MMR.
`token_budget`	optional	Max tokens after MMR; `null` = no cap.
`semantic_dedup_threshold`	`0.001`	Tight cosine distance for paraphrase detection (see tuning).
`dedup_threshold`	`0.15`	Cosine distance threshold for agglomerative clustering.
`cluster_threshold`	optional	Overrides `dedup_threshold` for clustering only.
`cluster_linkage`	`average`	Agglomerative linkage: `single`, `complete`, or `average`.
`enable_mmr`	`true`	Apply MMR when candidates exceed `target_k`; if `false`, take top-k by `score`.
`mmr_lambda`	`0.5`	`1.0` = relevance only, `0.0` = diversity only (only when `enable_mmr=true`).
`representative_strategy`	`auto`	`auto` \| `score` \| `centroid` \| `query_closest` \| `longest`
`max_per_cluster`	`1`	If `>1`, send up to N chunks per cluster into MMR before final selection.
`compress`	`true`	Apply lightweight compression to output text.
`embedding_model`	`BAAI/bge-small-en-v1.5`	Used only when any chunk is missing an embedding.

Representative strategies

After topical clustering, several chunks may still belong to the same group (e.g. three passages tagged caching that survived semantic dedup because wording differed enough). Representative selection picks one chunk per cluster to send forward to MMR. That keeps MMR focused on topics, not on picking among siblings in the same cluster.

Set via representative_strategy (default: auto).

Strategy	How the winner is chosen	Best for
`auto`	If any chunk in the cluster has `score > 0`, use `score`; otherwise use `centroid`.	Most RAG pipelines (vector DB or reranker already provides scores).
`score`	Highest `score` in the cluster.	Hybrid / BM25 / vector search where `score` reflects retriever confidence.
`centroid`	Chunk whose embedding is closest to the cluster’s average embedding.	Tool output, logs, or scraped text with no retrieval score — picks the most “typical” passage, not the longest or highest arbitrary score.
`query_closest`	Chunk whose embedding has highest cosine similarity to `query_embedding`. Requires `query_embedding` (or `query` so the server can embed it).	Q&A when the best evidence is “closest to what the user asked,” not highest retriever score (e.g. a lower-ranked chunk that directly answers the question).
`longest`	Chunk with the most characters in `text`.	Summarization or context packing when length proxies for detail (use carefully — long ≠ relevant).

Example. One cluster contains:

redis_core_1 — score 0.96, defines in-memory caching
redis_cache_paraphrase — score 0.94, shorter paraphrase

With score or auto, redis_core_1 becomes the representative. With query_closest, the winner depends on which embedding aligns better with the query vector.

Interaction with max_per_cluster. Default max_per_cluster=1: only representatives go to MMR. If max_per_cluster > 1, up to N highest-scored chunks per cluster are passed to MMR instead of a single representative — useful when a cluster is broad and you want MMR to trim within it.

Tuning

Use two thresholds — they answer different questions:

Parameter	Question it answers	Typical range
`semantic_dedup_threshold`	“Are these the same information?”	`0.0005` – `0.02` (keep tight)
`dedup_threshold`	“Which chunks belong to the same topic?”	`0.10` – `0.35` (looser)

MMR:

mmr_score = λ × relevance − (1 − λ) × diversity_penalty

Relevance: cosine(query, chunk) if query_embedding set, else normalized retrieval score.
Diversity penalty: max(embedding similarity, lexical similarity) vs. already-selected chunks.

Practical defaults for prose RAG: semantic_dedup_threshold=0.001, dedup_threshold=0.15, mmr_lambda=0.5–0.75, target_k=6–10.

Note: Low-dimensional or hand-made test embeddings often sit in a tight cone; SlimContext compensates with text/intent signals. Use real embeddings (e.g. bge-small) for production tuning.

Benchmarks

The latencies below are from the offline benchmark script, which does not use caller-supplied embeddings (it builds 64-dim hash vectors per chunk). On POST /v1/optimize with embeddings already attached to every chunk, the same pipeline is typically under ~10 ms per request — often faster — because no embedding model runs.

Pipeline-only evaluation on benchmarks/data/dirty_test_set.json (100 RAG-style cases, 5 noisy chunks each including one exact duplicate and one truth chunk). No LLM calls — metrics are word counts, chunk counts, whether the truth chunk survived, and wall-clock time. Embeddings are deterministic 64-dim hash vectors generated inside the script (not BGE).

From the project root (SlimContext/):

$env:PYTHONPATH = (Get-Location)
python benchmarks/run_dirty_eval.py --target-k 1 --dedup-threshold 0.15 --mmr-lambda 0.8

PYTHONPATH=. python benchmarks/run_dirty_eval.py --target-k 1 --dedup-threshold 0.15 --mmr-lambda 0.8

Outputs: benchmarks/results/manual_eval/summary.json, summary.csv, spot_check_case_1.md.

Results (measured May 2026)

Config: --target-k 1 --dedup-threshold 0.15 --semantic-dedup-threshold 0.001 (default) --mmr-lambda 0.8 --token-budget 1500

Metric	Value
Cases evaluated	100
Total words (raw retrieved context)	26,685
Total words (after SlimContext)	7,102
Word reduction	73.39%
Truth chunk retained (`id: truth`)	48% of cases
Total runtime	2,594.67 ms
Avg latency per case	25.95 ms

Same dataset with more output slots: --target-k 8 --mmr-lambda 0.75

Metric	Value
Total words (after SlimContext)	10,149
Word reduction	61.97%
Truth chunk retained	48% of cases
Avg latency per case	9.53 ms

Example — case 1 (Massachusetts compulsory education query):

Metric	Before	After
Chunks	5	1
Words	206	59
Word reduction	—	71.36%
Exact duplicates removed	—	1
Clusters formed	—	4
Truth retained	—	no (`exact_duplicate` kept over `truth`, same text, higher pipeline order)

Interpretation: high reduction is expected with target_k=1. Truth retention depends on scores, clustering, and MMR — it is not an answer-quality benchmark. For production, use real retriever embeddings and tune target_k, mmr_lambda, and thresholds on your own data.

Project layout

app/
  api.py                  # FastAPI — POST /v1/optimize
  core/
    dedupe.py             # Exact hash dedup
    semantic_dedup.py     # Paraphrase + intent collapse
    clustering.py         # Topical clusters + representatives
    mmr.py                # MMR + token budget packing
    compression.py        # Deterministic text pruning
    text_similarity.py    # Lexical / TF-IDF helpers
    vectors.py            # Embedding utilities
benchmarks/
  run_dirty_eval.py       # Pipeline metrics without an LLM
  data/dirty_test_set.json
tests/

Credits

Pipeline concepts (over-fetch, semantic dedup, clustering, MMR) draw on ideas explored in Distill and the Agentic Engineering Guide — context engineering stack. SlimContext is an independent project with its own codebase and API.

License

MIT — see LICENSE in this repository.

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
app		app
benchmarks		benchmarks
tests		tests
.gitignore		.gitignore
README.md		README.md
pytest.ini		pytest.ini
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

SlimContext

The problem

Where it fits

Architecture

End-to-end pipeline

What each layer does (and why)

Why no LLM?

Quick start

API

`POST /v1/optimize`

Parameters

Representative strategies

Tuning

Benchmarks

Results (measured May 2026)

Project layout

Credits

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

SlimContext

The problem

Where it fits

Architecture

End-to-end pipeline

What each layer does (and why)

Why no LLM?

Quick start

API

POST /v1/optimize

Parameters

Representative strategies

Tuning

Benchmarks

Results (measured May 2026)

Project layout

Credits

License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

`POST /v1/optimize`

Packages