Your AI Agent Is Being Sabotaged
by Its Own Search Results

When an LLM retrieves context, a result that looks right but is wrong doesn't just waste a slot โ€” it actively corrupts reasoning. Traditional metrics can't see this. Hive can.

#1 API Authentication: Bearer Tokens & API Keys RELEVANT
#2 Authentication FAQ: Password Reset & 2FA Setup DISTRACTOR
#3 API Authentication: Token Refresh RELEVANT
#4 Rate Limits: Handling 429 Errors IRRELEVANT
#5 SDK Authentication Configuration RELEVANT

The red card shares keywords with the query ("authentication") but answers a completely different question (password resets, not API tokens). When the LLM reads it, it contaminates the answer. This is Mutually Assured Distraction (MAD).

nDCG says
0.75
"Looks fine"
nUDCG says
0.35
"Reveals the problem"

The Feedback Loop

The agent writes configs, measures results, and iterates โ€” like a developer running tests after each code change.

๐Ÿ“ 1. Config

Agent writes a JSON config controlling search behavior

The config specifies: method (keyword, vector, hybrid), how many results, what filters to apply, whether to enable dynamic-k and distraction detection. It's a file on disk โ€” versioned, diffable, reviewable.
๐Ÿ” 2. Search

Hive runs the search with ranked results and scores

Combines BM25 (keyword) and vector (semantic) search via Reciprocal Rank Fusion. Optionally flags results where the two methods disagree โ€” a signal of potential distractors.
๐Ÿ“Š 3. Evaluate

Scores results against known-good answers using UDCG

UDCG assigns +1 to relevant results, -1 to distractors, and 0 to irrelevant ones. Each is weighted by rank position โ€” top ranks matter most. The resulting score reveals problems that nDCG hides.
๐Ÿง  4. Reason

Agent reads scores and identifies what went wrong

The LLM sees: "nUDCG is 0.38, 2 distractors from FAQs." It reasons: "FAQ documents share keywords with API docs but answer different questions. I should filter them out or enable distraction detection."
๐Ÿ”ง 5. Improve

Agent writes a new config and the loop repeats

The new config might enable category filtering, turn on dynamic-k, or adjust disagreement thresholds. It's validated before evaluation โ€” just like compiling code before running tests.
nUDCG Progress 0.35 โ†’ 0.58 โ†’ 0.75

First Principles & Primitives

Hive is built on a small set of composable primitives. Each one solves a specific problem in retrieval quality โ€” and together they form a system where an AI agent can self-improve without human intervention. These aren't abstract ideas; each primitive maps to concrete code and measurable behavior.

1

Everything Is a File

The agent's interface is a directory of JSON files โ€” no custom protocols, no APIs to learn.

This is the key Hornet insight: align retrieval configuration with how coding agents already work. An agent that can edit tsconfig.json can edit a Hive config. No new abstractions. The entire workspace is files on disk โ€” versioned, diffable, reviewable.

workspace/
collections/knowledge-base.json โ€” "what does the data look like?"
configs/v1.json โ€” "how should I search?" (the code)
configs/v2.json โ€” "improved version" (agent creates this)
configs/active.json โ€” symlink to deployed config
documents/*.md โ€” the actual content
evals/golden.json โ€” known-good query-result pairs

Why this matters: LLM agents are fundamentally code-writing tools. They understand files, directories, and JSON. By making retrieval config a file, we turn search optimization into a coding task โ€” which is exactly what these agents are good at.

2

The Retrieval Config IS the Source Code

A config file is the "program" that tells Hive how to search. It's the thing the agent iterates on.

{
  "name": "v1-naive",
  "collection": "knowledge-base",
  "retrieval": {
    "method": "hybrid",   // keyword | vector | hybrid
    "top_k": 10,
    "rrf_k": 60           // RRF constant (see Primitive 6)
  },
  "dynamic_k": {
    "enabled": false,
    "gap_threshold_factor": 3.0,  // cliff detection sensitivity
    "min_results": 1,
    "max_results": 10
  },
  "filters": {},           // category pre-filtering
  "distraction_detection": {
    "enabled": false,
    "disagreement_threshold": 0.5
  }
}

Key design choices in this config:

  • RRF instead of weighted scores โ€” scale-invariant, no normalization needed
  • gap_threshold_factor โ€” gap-based cliff detection that works with RRF's compressed score range
  • distraction_detection requires method: hybrid โ€” needs both BM25 and vector ranks to compute disagreement
  • filters โ€” pre-filter by category BEFORE scoring, not after (eliminates distractors before they enter the ranking)
3

Collection Schema Defines Filterable Fields

The agent discovers what data looks like by reading the schema โ€” then uses that knowledge to build better configs.

Fields marked filterable: true can be used in config filters. Filters are applied as a WHERE clause before scoring (pre-filter), not after. This means distractors in excluded categories never even enter the ranking pipeline.

{
  "name": "knowledge-base",
  "fields": {
    "title":    { "type": "text" },
    "category": { "type": "keyword", "filterable": true },
    "content":  { "type": "text" }
  },
  "chunking": {
    "strategy": "by_heading",
    "max_tokens": 512,
    "heading_level": 2
  }
}

The agent can read this file to discover: "category is filterable, and the available values are api-docs, tutorials, faqs, changelogs." Then it can write a config with "filters": {"category": ["api-docs", "tutorials"]} to exclude FAQ distractors entirely.

4

Verifiable APIs: Three-Level Validation as a Compiler

Every config change goes through syntactic, semantic, and behavioral verification โ€” like compiling, linting, and running tests.

This is the Verifiable APIs principle from Hornet: the system's interface should be checkable at multiple levels before anything gets deployed. The agent gets immediate, actionable feedback โ€” not vague errors, but specific "here's what's wrong and how to fix it" messages.

Level 1: Syntactic Like a compiler

"Does the config parse correctly?"

Is top_k a positive integer? Is method one of keyword|vector|hybrid? Are all required fields present?
Level 2: Semantic Like a linter

"Do the settings make logical sense together?"

โœ— distraction_detection.enabled is true but method is "keyword".
  Distraction detection requires hybrid to compare keyword vs vector rankings.
  โ†’ Set method to "hybrid" or disable distraction detection.
Level 3: Behavioral (Deploy Gate) Like a test suite

"Does it actually perform better than what's deployed?"

Deploy BLOCKED: nUDCG regression 0.61 โ†’ 0.34. The system refuses to make things worse.
Run hive compare to see what changed.

The behavioral gate is the critical innovation: even if a config passes syntactic and semantic checks, it must prove it's better by running against the golden evaluation set. This prevents an agent from "improving" a config that actually makes retrieval worse.

5

Dynamic-k: Gap-Based Cutoff

Instead of always returning top_k results, stop when there's a quality cliff โ€” "retrieve less, retrieve better."

Traditional retrieval always returns a fixed number of results (e.g., 10). But what if only 5 are good and the rest are noise? Dynamic-k detects the "quality cliff" where results transition from strong to weak, and stops there.

Why not score-ratio cutoff? RRF scores are compressed into a narrow band (all values between ~0.018 and ~0.033 for k=60). A result at BM25 rank 20 / vector rank 30 still has a ratio of ~0.72 to the best result. Score-ratio cutoffs would almost never trigger with RRF. Instead, we measure relative gaps.

The algorithm: After ranking by RRF, compute the gap between each consecutive pair. Track a running mean of gaps seen so far. If gap_i > gap_threshold_factor × mean_gap_so_far, stop โ€” cliff detected.

Rank RRF Score Gap Running Mean Gap / Mean
10.0328โ€”โ€”โ€”
20.03180.00100.00101.0x โœ“
30.03070.00110.001051.0x โœ“
40.02950.00120.00111.1x โœ“
50.02810.00140.001181.2x โœ“
60.02660.00150.001241.2x โœ“
70.01980.00680.001245.5x STOP

Returns 6 results instead of 10. Ranks 1โ€“6 have gradually increasing but consistent gaps (~0.001), then rank 7 drops sharply โ€” the cliff. This is interpretable: "stop when the next result is dramatically worse than the trend so far." The approach works with compressed RRF scores because it measures relative gaps, not absolute score levels.

6

Reciprocal Rank Fusion (RRF) for Hybrid Scoring

Combine keyword search and semantic search using ranks, not raw scores โ€” no normalization needed.

The naive approach to hybrid search is to normalize BM25 scores (range 0โ€“15) and cosine similarity scores (range 0โ€“1) to the same scale and take a weighted average. This is fragile โ€” the normalization depends on the query, the corpus size, and the score distributions. RRF sidesteps all of this by operating on ranks instead of scores.

rrf_score(doc) = 1/(k + rankbm25) + 1/(k + rankvector)

Where k is a constant (default 60, configurable as rrf_k). Three advantages:

  • No normalization needed โ€” operates on ranks, not raw scores
  • Scale-invariant โ€” BM25 producing scores 0โ€“15 and cosine producing 0โ€“1 doesn't matter
  • Well-studied โ€” used by Elasticsearch, Vespa, and many production RAG systems

For keyword-only mode, only BM25 ranks are used. For vector-only, only cosine ranks. For hybrid, both contribute to the final score.

7

Distraction Detection via Score Disagreement

A result that ranks high in keywords but low in meaning (or vice versa) is suspicious โ€” flag it.

Distractors share keywords with the query but mean something different. BM25 ranks them high (keyword overlap), but vector search ranks them low (different semantics). This disagreement between scoring methods is a detectable signal at query time โ€” no golden labels needed.

disagreement = |rankbm25 - rankvector| / max(rankbm25, rankvector)
Doc A: Consistent ranking
BM25 rank 2, vector rank 3
disagreement = 1/3 = 0.33 (OK)
Doc B: Classic keyword trap
BM25 rank 1, vector rank 12
disagreement = 11/12 = 0.92 (FLAGGED)

Doc B matches on keywords but means something different โ€” a classic FAQ distractor. The flag appears in query output so the agent can see it, but it does NOT auto-filter the result. The agent decides what to do. At evaluation time (not query time), golden labels provide exact distractor counts as ground truth.

8

UDCG: Utility-Discounted Cumulative Gain

The metric that makes distractors visible โ€” traditional nDCG treats wrong answers as harmless zeros. UDCG assigns them -1.

UDCG@k = Σi=1k utility(doci) / log2(i + 1)
+1
Relevant
0
Irrelevant
-1
Distractor

Each result's contribution is weighted by its rank position โ€” a distractor at rank 1 costs more than one at rank 5. The result is normalized by the ideal score (all relevant docs ranked first) to get nUDCG in the range [-1, +1].

Worked example โ€” query with 3 relevant (R), 1 distractor (D), 1 irrelevant (I):

Rank Doc Utility Discount Contribution
1R+1.01.000+1.000
2D-1.00.631-0.631
3R+1.00.500+0.500
4I0.00.4310.000
5R+1.00.387+0.387
UDCG@5
1.256
Ideal (3R at top)
2.131
nUDCG@5
0.59

Key insight: The distractor at rank 2 cost 0.631 points. If we removed it and shifted results up, nUDCG would jump to ~0.85. This is what the agent discovers through the feedback loop โ€” removing distractors is more valuable than finding more relevant documents.

Document-level dedup: Golden labels are per-document but retrieval returns per-chunk results. A document may produce multiple chunks. To avoid double-counting: for each document, only the highest-ranked chunk contributes to UDCG. All subsequent chunks from the same document are skipped (utility = 0). This prevents distractor penalties from being inflated by multi-chunk documents.

!

Mutually Assured Distraction (MAD)

The foundational insight: a wrong result doesn't just waste a slot โ€” it actively corrupts LLM reasoning.

Traditional search evaluation treats results as either relevant (1) or irrelevant (0). This made sense when humans scanned a list โ€” an irrelevant result was just skipped. But when an LLM reads search results as context, a distractor (a result that looks right but is wrong) actively corrupts the answer. The LLM can't distinguish between "API authentication via bearer tokens" and "authentication FAQ about password resets" โ€” both mention authentication, but one answers the wrong question.

This is the MAD insight from Hornet: in LLM-powered retrieval, irrelevant is not the worst outcome โ€” misleading is. An FAQ document about password resets doesn't just fail to help with an API auth question; it pulls the LLM's response toward password reset instructions, contaminating the final answer.

Every primitive in Hive exists to address this: UDCG makes the cost visible (-1 per distractor), score disagreement detects them at query time, dynamic-k prevents over-retrieving them, category filters exclude them structurally, and deploy gates prevent configs that increase them. The entire system is oriented around one question: "are we injecting harmful context into the LLM?"

0

Zero External Infrastructure

No Elasticsearch. No Pinecone. No Docker. BM25 from scratch, vector search with numpy, all in SQLite.

Hive implements the entire search engine from first principles: BM25 scoring with an inverted index (~100 lines), vector search via cosine similarity on numpy arrays, and RRF fusion โ€” all stored in a single SQLite database. For the prototype's corpus of ~60 chunks, this runs in milliseconds with no external services.

BM25: Inverted index in SQLite. Per-chunk term frequencies in postings, per-term doc frequencies in term_stats, corpus stats for length normalization. O(matching chunks) per query term.
Vector search: Embeddings stored as numpy arrays serialized via tobytes() in SQLite BLOBs. Cosine similarity via np.dot. For 50โ€“150 chunks, this is instant (<5ms). No ANN index needed.
Pre-filtering: Category filters applied as a WHERE clause on metadata before scoring. Distractors in excluded categories never enter the ranking pipeline.
Embedding caching: Pre-computed embeddings saved to .npz files. Demo runs with cached embeddings โ€” no OpenAI API calls during the recording. BM25-only mode available when no API key is set.

Three-Level Verification

Every config change goes through the same verification pipeline โ€” like compiling, linting, and testing code.

โœ“

Syntactic

Like a compiler

Does the config parse correctly?

โœ“ top_k is a positive integer   โœ“ method is keyword|vector|hybrid   โœ“ all required fields present
๐Ÿง 

Semantic

Like a linter

Do the settings make logical sense together?

โœ— distraction_detection.enabled is true but method is "keyword". Distraction detection requires "hybrid" to compare keyword vs vector rankings.
๐Ÿ›ก

Behavioral

Like a test suite

Does it actually perform better?

Deploy BLOCKED: candidate nUDCG (0.34) < active nUDCG (0.61). The system refuses to make search quality worse.

The Demo Story

An AI agent starts with a naive config and optimizes it in 3 steps โ€” zero human intervention.

Act 1

v1: Naive Baseline

Hybrid search, top-k=10, no filtering, no dynamic-k

nUDCG
0.35
Distractors
4
Act 2

v2: Filter + Detect

Agent excludes FAQ category, enables distraction detection

nUDCG
0.58
Distractors
1
Act 3

v3: Dynamic-k

Agent enables gap-based cutoff โ€” fewer results, higher quality

nUDCG
0.75
Distractors
0

Result: nUDCG improved 114% (0.35 โ†’ 0.75). Distractors eliminated (4 โ†’ 0). Zero human intervention.

How It Works

Everything runs locally. SQLite for storage, Python for the engine, JSON files for configs.

flowchart TD
    subgraph workspace [Workspace: Files on Disk]
        Collections["collections/*.json"]
        Configs["configs/*.json"]
        ActiveConfig["configs/active.json"]
        Documents["documents/*.md"]
        Evals["evals/golden.json"]
    end

    subgraph engine [Hive Engine]
        Validator["Validator\n(syntactic + semantic + behavioral)"]
        Indexer["Indexer\n(chunk + BM25 + embed)"]
        Searcher["Searcher\n(RRF + dynamic-k + disagreement)"]
        Evaluator["Evaluator\n(UDCG + comparison)"]
    end

    subgraph storage [SQLite: hive.db]
        ChunksTable["chunks"]
        PostingsTable["postings + term_stats"]
        ConfigHistory["config_versions"]
        EvalResults["eval_results"]
    end

    CLI["CLI: validate | index | query | evaluate | compare | deploy"]

    CLI --> Validator
    CLI --> Indexer
    CLI --> Searcher
    CLI --> Evaluator

    Validator --> Configs
    Indexer --> Documents
    Searcher --> ChunksTable
    Evaluator --> Evals
        

Key Concepts

Quick reference for non-technical readers.

UDCG

A search quality score that penalizes wrong answers, not just rewards right ones. Traditional nDCG treats distractors as harmless zeros.

Distractor

A search result that looks right but means something different. Like finding "Password Reset FAQ" when you searched for "API Authentication."

Dynamic-k

Instead of always returning 10 results, stop when quality drops off a cliff. Fewer results = fewer chances for distractors to sneak in.

RRF (Reciprocal Rank Fusion)

A way to combine keyword search and meaning search without worrying about incompatible score scales. Rank-based, not score-based.

Deploy Gate

The system refuses to deploy a config that makes search quality worse. Like a CI pipeline that blocks merging if tests fail.

Feedback Loop

The agent tries, measures, learns, and tries again. Like a developer running tests after each code change โ€” but fully automated.