When an LLM retrieves context, a result that looks right but is wrong doesn't just waste a slot โ it actively corrupts reasoning. Traditional metrics can't see this. Hive can.
The red card shares keywords with the query ("authentication") but answers a completely different question (password resets, not API tokens). When the LLM reads it, it contaminates the answer. This is Mutually Assured Distraction (MAD).
The agent writes configs, measures results, and iterates โ like a developer running tests after each code change.
Agent writes a JSON config controlling search behavior
Hive runs the search with ranked results and scores
Scores results against known-good answers using UDCG
Agent reads scores and identifies what went wrong
Agent writes a new config and the loop repeats
Hive is built on a small set of composable primitives. Each one solves a specific problem in retrieval quality โ and together they form a system where an AI agent can self-improve without human intervention. These aren't abstract ideas; each primitive maps to concrete code and measurable behavior.
The agent's interface is a directory of JSON files โ no custom protocols, no APIs to learn.
This is the key Hornet insight: align retrieval configuration with how coding agents already work. An agent that can edit tsconfig.json can edit a Hive config. No new abstractions. The entire workspace is files on disk โ versioned, diffable, reviewable.
Why this matters: LLM agents are fundamentally code-writing tools. They understand files, directories, and JSON. By making retrieval config a file, we turn search optimization into a coding task โ which is exactly what these agents are good at.
A config file is the "program" that tells Hive how to search. It's the thing the agent iterates on.
{
"name": "v1-naive",
"collection": "knowledge-base",
"retrieval": {
"method": "hybrid", // keyword | vector | hybrid
"top_k": 10,
"rrf_k": 60 // RRF constant (see Primitive 6)
},
"dynamic_k": {
"enabled": false,
"gap_threshold_factor": 3.0, // cliff detection sensitivity
"min_results": 1,
"max_results": 10
},
"filters": {}, // category pre-filtering
"distraction_detection": {
"enabled": false,
"disagreement_threshold": 0.5
}
}
Key design choices in this config:
The agent discovers what data looks like by reading the schema โ then uses that knowledge to build better configs.
Fields marked filterable: true can be used in config filters. Filters are applied as a WHERE clause before scoring (pre-filter), not after. This means distractors in excluded categories never even enter the ranking pipeline.
{
"name": "knowledge-base",
"fields": {
"title": { "type": "text" },
"category": { "type": "keyword", "filterable": true },
"content": { "type": "text" }
},
"chunking": {
"strategy": "by_heading",
"max_tokens": 512,
"heading_level": 2
}
}
The agent can read this file to discover: "category is filterable, and the available values are api-docs, tutorials, faqs, changelogs." Then it can write a config with "filters": {"category": ["api-docs", "tutorials"]} to exclude FAQ distractors entirely.
Every config change goes through syntactic, semantic, and behavioral verification โ like compiling, linting, and running tests.
This is the Verifiable APIs principle from Hornet: the system's interface should be checkable at multiple levels before anything gets deployed. The agent gets immediate, actionable feedback โ not vague errors, but specific "here's what's wrong and how to fix it" messages.
"Does the config parse correctly?"
"Do the settings make logical sense together?"
"Does it actually perform better than what's deployed?"
hive compare to see what changed.
The behavioral gate is the critical innovation: even if a config passes syntactic and semantic checks, it must prove it's better by running against the golden evaluation set. This prevents an agent from "improving" a config that actually makes retrieval worse.
Instead of always returning top_k results, stop when there's a quality cliff โ "retrieve less, retrieve better."
Traditional retrieval always returns a fixed number of results (e.g., 10). But what if only 5 are good and the rest are noise? Dynamic-k detects the "quality cliff" where results transition from strong to weak, and stops there.
Why not score-ratio cutoff? RRF scores are compressed into a narrow band (all values between ~0.018 and ~0.033 for k=60). A result at BM25 rank 20 / vector rank 30 still has a ratio of ~0.72 to the best result. Score-ratio cutoffs would almost never trigger with RRF. Instead, we measure relative gaps.
The algorithm: After ranking by RRF, compute the gap between each consecutive pair. Track a running mean of gaps seen so far. If gap_i > gap_threshold_factor × mean_gap_so_far, stop โ cliff detected.
| Rank | RRF Score | Gap | Running Mean | Gap / Mean |
|---|---|---|---|---|
| 1 | 0.0328 | โ | โ | โ |
| 2 | 0.0318 | 0.0010 | 0.0010 | 1.0x โ |
| 3 | 0.0307 | 0.0011 | 0.00105 | 1.0x โ |
| 4 | 0.0295 | 0.0012 | 0.0011 | 1.1x โ |
| 5 | 0.0281 | 0.0014 | 0.00118 | 1.2x โ |
| 6 | 0.0266 | 0.0015 | 0.00124 | 1.2x โ |
| 7 | 0.0198 | 0.0068 | 0.00124 | 5.5x STOP |
Returns 6 results instead of 10. Ranks 1โ6 have gradually increasing but consistent gaps (~0.001), then rank 7 drops sharply โ the cliff. This is interpretable: "stop when the next result is dramatically worse than the trend so far." The approach works with compressed RRF scores because it measures relative gaps, not absolute score levels.
Combine keyword search and semantic search using ranks, not raw scores โ no normalization needed.
The naive approach to hybrid search is to normalize BM25 scores (range 0โ15) and cosine similarity scores (range 0โ1) to the same scale and take a weighted average. This is fragile โ the normalization depends on the query, the corpus size, and the score distributions. RRF sidesteps all of this by operating on ranks instead of scores.
Where k is a constant (default 60, configurable as rrf_k). Three advantages:
For keyword-only mode, only BM25 ranks are used. For vector-only, only cosine ranks. For hybrid, both contribute to the final score.
A result that ranks high in keywords but low in meaning (or vice versa) is suspicious โ flag it.
Distractors share keywords with the query but mean something different. BM25 ranks them high (keyword overlap), but vector search ranks them low (different semantics). This disagreement between scoring methods is a detectable signal at query time โ no golden labels needed.
Doc B matches on keywords but means something different โ a classic FAQ distractor. The flag appears in query output so the agent can see it, but it does NOT auto-filter the result. The agent decides what to do. At evaluation time (not query time), golden labels provide exact distractor counts as ground truth.
The metric that makes distractors visible โ traditional nDCG treats wrong answers as harmless zeros. UDCG assigns them -1.
Each result's contribution is weighted by its rank position โ a distractor at rank 1 costs more than one at rank 5. The result is normalized by the ideal score (all relevant docs ranked first) to get nUDCG in the range [-1, +1].
Worked example โ query with 3 relevant (R), 1 distractor (D), 1 irrelevant (I):
| Rank | Doc | Utility | Discount | Contribution |
|---|---|---|---|---|
| 1 | R | +1.0 | 1.000 | +1.000 |
| 2 | D | -1.0 | 0.631 | -0.631 |
| 3 | R | +1.0 | 0.500 | +0.500 |
| 4 | I | 0.0 | 0.431 | 0.000 |
| 5 | R | +1.0 | 0.387 | +0.387 |
Key insight: The distractor at rank 2 cost 0.631 points. If we removed it and shifted results up, nUDCG would jump to ~0.85. This is what the agent discovers through the feedback loop โ removing distractors is more valuable than finding more relevant documents.
Document-level dedup: Golden labels are per-document but retrieval returns per-chunk results. A document may produce multiple chunks. To avoid double-counting: for each document, only the highest-ranked chunk contributes to UDCG. All subsequent chunks from the same document are skipped (utility = 0). This prevents distractor penalties from being inflated by multi-chunk documents.
The foundational insight: a wrong result doesn't just waste a slot โ it actively corrupts LLM reasoning.
Traditional search evaluation treats results as either relevant (1) or irrelevant (0). This made sense when humans scanned a list โ an irrelevant result was just skipped. But when an LLM reads search results as context, a distractor (a result that looks right but is wrong) actively corrupts the answer. The LLM can't distinguish between "API authentication via bearer tokens" and "authentication FAQ about password resets" โ both mention authentication, but one answers the wrong question.
This is the MAD insight from Hornet: in LLM-powered retrieval, irrelevant is not the worst outcome โ misleading is. An FAQ document about password resets doesn't just fail to help with an API auth question; it pulls the LLM's response toward password reset instructions, contaminating the final answer.
Every primitive in Hive exists to address this: UDCG makes the cost visible (-1 per distractor), score disagreement detects them at query time, dynamic-k prevents over-retrieving them, category filters exclude them structurally, and deploy gates prevent configs that increase them. The entire system is oriented around one question: "are we injecting harmful context into the LLM?"
No Elasticsearch. No Pinecone. No Docker. BM25 from scratch, vector search with numpy, all in SQLite.
Hive implements the entire search engine from first principles: BM25 scoring with an inverted index (~100 lines), vector search via cosine similarity on numpy arrays, and RRF fusion โ all stored in a single SQLite database. For the prototype's corpus of ~60 chunks, this runs in milliseconds with no external services.
postings, per-term doc frequencies in term_stats, corpus stats for length normalization. O(matching chunks) per query term.
tobytes() in SQLite BLOBs. Cosine similarity via np.dot. For 50โ150 chunks, this is instant (<5ms). No ANN index needed.
.npz files. Demo runs with cached embeddings โ no OpenAI API calls during the recording. BM25-only mode available when no API key is set.
Every config change goes through the same verification pipeline โ like compiling, linting, and testing code.
Does the config parse correctly?
Do the settings make logical sense together?
Does it actually perform better?
An AI agent starts with a naive config and optimizes it in 3 steps โ zero human intervention.
Hybrid search, top-k=10, no filtering, no dynamic-k
Agent excludes FAQ category, enables distraction detection
Agent enables gap-based cutoff โ fewer results, higher quality
Result: nUDCG improved 114% (0.35 โ 0.75). Distractors eliminated (4 โ 0). Zero human intervention.
Everything runs locally. SQLite for storage, Python for the engine, JSON files for configs.
flowchart TD
subgraph workspace [Workspace: Files on Disk]
Collections["collections/*.json"]
Configs["configs/*.json"]
ActiveConfig["configs/active.json"]
Documents["documents/*.md"]
Evals["evals/golden.json"]
end
subgraph engine [Hive Engine]
Validator["Validator\n(syntactic + semantic + behavioral)"]
Indexer["Indexer\n(chunk + BM25 + embed)"]
Searcher["Searcher\n(RRF + dynamic-k + disagreement)"]
Evaluator["Evaluator\n(UDCG + comparison)"]
end
subgraph storage [SQLite: hive.db]
ChunksTable["chunks"]
PostingsTable["postings + term_stats"]
ConfigHistory["config_versions"]
EvalResults["eval_results"]
end
CLI["CLI: validate | index | query | evaluate | compare | deploy"]
CLI --> Validator
CLI --> Indexer
CLI --> Searcher
CLI --> Evaluator
Validator --> Configs
Indexer --> Documents
Searcher --> ChunksTable
Evaluator --> Evals
Quick reference for non-technical readers.
A search quality score that penalizes wrong answers, not just rewards right ones. Traditional nDCG treats distractors as harmless zeros.
A search result that looks right but means something different. Like finding "Password Reset FAQ" when you searched for "API Authentication."
Instead of always returning 10 results, stop when quality drops off a cliff. Fewer results = fewer chances for distractors to sneak in.
A way to combine keyword search and meaning search without worrying about incompatible score scales. Rank-based, not score-based.
The system refuses to deploy a config that makes search quality worse. Like a CI pipeline that blocks merging if tests fail.
The agent tries, measures, learns, and tries again. Like a developer running tests after each code change โ but fully automated.