Hive — Retrieval Engine with Verifiable Feedback Loops

First Principles & Primitives

Hive is built on a small set of composable primitives. Each one solves a specific problem in retrieval quality — and together they form a system where an AI agent can self-improve without human intervention. These aren't abstract ideas; each primitive maps to concrete code and measurable behavior.

Everything Is a File

The agent's interface is a directory of JSON files — no custom protocols, no APIs to learn.

This is the key Hornet insight: align retrieval configuration with how coding agents already work. An agent that can edit tsconfig.json can edit a Hive config. No new abstractions. The entire workspace is files on disk — versioned, diffable, reviewable.

workspace/

collections/knowledge-base.json — "what does the data look like?"

configs/v1.json — "how should I search?" (the code)

configs/v2.json — "improved version" (agent creates this)

configs/active.json — symlink to deployed config

documents/*.md — the actual content

evals/golden.json — known-good query-result pairs

Why this matters: LLM agents are fundamentally code-writing tools. They understand files, directories, and JSON. By making retrieval config a file, we turn search optimization into a coding task — which is exactly what these agents are good at.

The Retrieval Config IS the Source Code

A config file is the "program" that tells Hive how to search. It's the thing the agent iterates on.

{
  "name": "v1-naive",
  "collection": "knowledge-base",
  "retrieval": {
    "method": "hybrid",   // keyword | vector | hybrid
    "top_k": 10,
    "rrf_k": 60           // RRF constant (see Primitive 6)
  },
  "dynamic_k": {
    "enabled": false,
    "gap_threshold_factor": 3.0,  // cliff detection sensitivity
    "min_results": 1,
    "max_results": 10
  },
  "filters": {},           // category pre-filtering
  "distraction_detection": {
    "enabled": false,
    "disagreement_threshold": 0.5
  }
}

Key design choices in this config:

RRF instead of weighted scores — scale-invariant, no normalization needed
gap_threshold_factor — gap-based cliff detection that works with RRF's compressed score range
distraction_detection requires method: hybrid — needs both BM25 and vector ranks to compute disagreement
filters — pre-filter by category BEFORE scoring, not after (eliminates distractors before they enter the ranking)

Collection Schema Defines Filterable Fields

The agent discovers what data looks like by reading the schema — then uses that knowledge to build better configs.

Fields marked filterable: true can be used in config filters. Filters are applied as a WHERE clause before scoring (pre-filter), not after. This means distractors in excluded categories never even enter the ranking pipeline.

{
  "name": "knowledge-base",
  "fields": {
    "title":    { "type": "text" },
    "category": { "type": "keyword", "filterable": true },
    "content":  { "type": "text" }
  },
  "chunking": {
    "strategy": "by_heading",
    "max_tokens": 512,
    "heading_level": 2
  }
}

The agent can read this file to discover: "category is filterable, and the available values are api-docs, tutorials, faqs, changelogs." Then it can write a config with "filters": {"category": ["api-docs", "tutorials"]} to exclude FAQ distractors entirely.

Verifiable APIs: Three-Level Validation as a Compiler

Every config change goes through syntactic, semantic, and behavioral verification — like compiling, linting, and running tests.

This is the Verifiable APIs principle from Hornet: the system's interface should be checkable at multiple levels before anything gets deployed. The agent gets immediate, actionable feedback — not vague errors, but specific "here's what's wrong and how to fix it" messages.

Level 1: Syntactic Like a compiler

"Does the config parse correctly?"

Is top_k a positive integer? Is method one of keyword|vector|hybrid? Are all required fields present?

Level 2: Semantic Like a linter

"Do the settings make logical sense together?"

✗ distraction_detection.enabled is true but method is "keyword".
Distraction detection requires hybrid to compare keyword vs vector rankings.
→ Set method to "hybrid" or disable distraction detection.

Level 3: Behavioral (Deploy Gate) Like a test suite

"Does it actually perform better than what's deployed?"

Deploy BLOCKED: nUDCG regression 0.61 → 0.34. The system refuses to make things worse.
Run hive compare to see what changed.

The behavioral gate is the critical innovation: even if a config passes syntactic and semantic checks, it must prove it's better by running against the golden evaluation set. This prevents an agent from "improving" a config that actually makes retrieval worse.

Dynamic-k: Gap-Based Cutoff

Instead of always returning top_k results, stop when there's a quality cliff — "retrieve less, retrieve better."

Traditional retrieval always returns a fixed number of results (e.g., 10). But what if only 5 are good and the rest are noise? Dynamic-k detects the "quality cliff" where results transition from strong to weak, and stops there.

Why not score-ratio cutoff? RRF scores are compressed into a narrow band (all values between ~0.018 and ~0.033 for k=60). A result at BM25 rank 20 / vector rank 30 still has a ratio of ~0.72 to the best result. Score-ratio cutoffs would almost never trigger with RRF. Instead, we measure relative gaps.

The algorithm: After ranking by RRF, compute the gap between each consecutive pair. Track a running mean of gaps seen so far. If gap_i > gap_threshold_factor × mean_gap_so_far, stop — cliff detected.

Rank	RRF Score	Gap	Running Mean	Gap / Mean
1	0.0328	—	—	—
2	0.0318	0.0010	0.0010	1.0x ✓
3	0.0307	0.0011	0.00105	1.0x ✓
4	0.0295	0.0012	0.0011	1.1x ✓
5	0.0281	0.0014	0.00118	1.2x ✓
6	0.0266	0.0015	0.00124	1.2x ✓
7	0.0198	0.0068	0.00124	5.5x STOP

Returns 6 results instead of 10. Ranks 1–6 have gradually increasing but consistent gaps (~0.001), then rank 7 drops sharply — the cliff. This is interpretable: "stop when the next result is dramatically worse than the trend so far." The approach works with compressed RRF scores because it measures relative gaps, not absolute score levels.

Reciprocal Rank Fusion (RRF) for Hybrid Scoring

Combine keyword search and semantic search using ranks, not raw scores — no normalization needed.

The naive approach to hybrid search is to normalize BM25 scores (range 0–15) and cosine similarity scores (range 0–1) to the same scale and take a weighted average. This is fragile — the normalization depends on the query, the corpus size, and the score distributions. RRF sidesteps all of this by operating on ranks instead of scores.

rrf_score(doc) = 1/(k + rank_bm25) + 1/(k + rank_vector)

Where k is a constant (default 60, configurable as rrf_k). Three advantages:

No normalization needed — operates on ranks, not raw scores
Scale-invariant — BM25 producing scores 0–15 and cosine producing 0–1 doesn't matter
Well-studied — used by Elasticsearch, Vespa, and many production RAG systems

For keyword-only mode, only BM25 ranks are used. For vector-only, only cosine ranks. For hybrid, both contribute to the final score.

Distraction Detection via Score Disagreement

A result that ranks high in keywords but low in meaning (or vice versa) is suspicious — flag it.

Distractors share keywords with the query but mean something different. BM25 ranks them high (keyword overlap), but vector search ranks them low (different semantics). This disagreement between scoring methods is a detectable signal at query time — no golden labels needed.

disagreement = |rank_bm25 - rank_vector| / max(rank_bm25, rank_vector)

Doc A: Consistent ranking

BM25 rank 2, vector rank 3

disagreement = 1/3 = 0.33 (OK)

Doc B: Classic keyword trap

BM25 rank 1, vector rank 12

disagreement = 11/12 = 0.92 (FLAGGED)

Doc B matches on keywords but means something different — a classic FAQ distractor. The flag appears in query output so the agent can see it, but it does NOT auto-filter the result. The agent decides what to do. At evaluation time (not query time), golden labels provide exact distractor counts as ground truth.

UDCG: Utility-Discounted Cumulative Gain

The metric that makes distractors visible — traditional nDCG treats wrong answers as harmless zeros. UDCG assigns them -1.

UDCG@k = Σ_i=1^k utility(doc_i) / log₂(i + 1)

Relevant

Irrelevant

-1

Distractor

Each result's contribution is weighted by its rank position — a distractor at rank 1 costs more than one at rank 5. The result is normalized by the ideal score (all relevant docs ranked first) to get nUDCG in the range [-1, +1].

Worked example — query with 3 relevant (R), 1 distractor (D), 1 irrelevant (I):

Rank	Doc	Utility	Discount	Contribution
1	R	+1.0	1.000	+1.000
2	D	-1.0	0.631	-0.631
3	R	+1.0	0.500	+0.500
4	I	0.0	0.431	0.000
5	R	+1.0	0.387	+0.387

UDCG@5

1.256

Ideal (3R at top)

2.131

nUDCG@5

0.59

Key insight: The distractor at rank 2 cost 0.631 points. If we removed it and shifted results up, nUDCG would jump to ~0.85. This is what the agent discovers through the feedback loop — removing distractors is more valuable than finding more relevant documents.

Document-level dedup: Golden labels are per-document but retrieval returns per-chunk results. A document may produce multiple chunks. To avoid double-counting: for each document, only the highest-ranked chunk contributes to UDCG. All subsequent chunks from the same document are skipped (utility = 0). This prevents distractor penalties from being inflated by multi-chunk documents.

Mutually Assured Distraction (MAD)

The foundational insight: a wrong result doesn't just waste a slot — it actively corrupts LLM reasoning.

Traditional search evaluation treats results as either relevant (1) or irrelevant (0). This made sense when humans scanned a list — an irrelevant result was just skipped. But when an LLM reads search results as context, a distractor (a result that looks right but is wrong) actively corrupts the answer. The LLM can't distinguish between "API authentication via bearer tokens" and "authentication FAQ about password resets" — both mention authentication, but one answers the wrong question.

This is the MAD insight from Hornet: in LLM-powered retrieval, irrelevant is not the worst outcome — misleading is. An FAQ document about password resets doesn't just fail to help with an API auth question; it pulls the LLM's response toward password reset instructions, contaminating the final answer.

Every primitive in Hive exists to address this: UDCG makes the cost visible (-1 per distractor), score disagreement detects them at query time, dynamic-k prevents over-retrieving them, category filters exclude them structurally, and deploy gates prevent configs that increase them. The entire system is oriented around one question: "are we injecting harmful context into the LLM?"

Zero External Infrastructure

No Elasticsearch. No Pinecone. No Docker. BM25 from scratch, vector search with numpy, all in SQLite.

Hive implements the entire search engine from first principles: BM25 scoring with an inverted index (~100 lines), vector search via cosine similarity on numpy arrays, and RRF fusion — all stored in a single SQLite database. For the prototype's corpus of ~60 chunks, this runs in milliseconds with no external services.

BM25: Inverted index in SQLite. Per-chunk term frequencies in postings, per-term doc frequencies in term_stats, corpus stats for length normalization. O(matching chunks) per query term.

Vector search: Embeddings stored as numpy arrays serialized via tobytes() in SQLite BLOBs. Cosine similarity via np.dot. For 50–150 chunks, this is instant (<5ms). No ANN index needed.

Pre-filtering: Category filters applied as a WHERE clause on metadata before scoring. Distractors in excluded categories never enter the ranking pipeline.

Embedding caching: Pre-computed embeddings saved to .npz files. Demo runs with cached embeddings — no OpenAI API calls during the recording. BM25-only mode available when no API key is set.

Your AI Agent Is Being Sabotaged
by Its Own Search Results

The Feedback Loop

First Principles & Primitives

Everything Is a File

The Retrieval Config IS the Source Code

Collection Schema Defines Filterable Fields

Verifiable APIs: Three-Level Validation as a Compiler

Dynamic-k: Gap-Based Cutoff

Reciprocal Rank Fusion (RRF) for Hybrid Scoring

Distraction Detection via Score Disagreement

UDCG: Utility-Discounted Cumulative Gain

Mutually Assured Distraction (MAD)

Zero External Infrastructure

Three-Level Verification

Syntactic

Semantic

Behavioral

The Demo Story

v1: Naive Baseline

v2: Filter + Detect

v3: Dynamic-k

How It Works

Key Concepts

UDCG

Distractor

Dynamic-k

RRF (Reciprocal Rank Fusion)

Deploy Gate

Feedback Loop

Your AI Agent Is Being Sabotagedby Its Own Search Results

The Feedback Loop

First Principles & Primitives

Everything Is a File

The Retrieval Config IS the Source Code

Collection Schema Defines Filterable Fields

Verifiable APIs: Three-Level Validation as a Compiler

Dynamic-k: Gap-Based Cutoff

Reciprocal Rank Fusion (RRF) for Hybrid Scoring

Distraction Detection via Score Disagreement

UDCG: Utility-Discounted Cumulative Gain

Mutually Assured Distraction (MAD)

Zero External Infrastructure

Three-Level Verification

Syntactic

Semantic

Behavioral

The Demo Story

v1: Naive Baseline

v2: Filter + Detect

v3: Dynamic-k

How It Works

Key Concepts

UDCG

Distractor

Dynamic-k

RRF (Reciprocal Rank Fusion)

Deploy Gate

Feedback Loop

Your AI Agent Is Being Sabotaged
by Its Own Search Results