-
Notifications
You must be signed in to change notification settings - Fork 2
research(routing): Memory Augmented Routing — use retrieved context to downgrade to smaller model, 96% cost reduction (arXiv:2603.23013) #2443
Description
Paper
Title: Knowledge Access Beats Model Size: Memory Augmented Routing for Persistent AI Agents
arXiv: https://arxiv.org/abs/2603.23013
Published: 2026-03-24
Key Technique
Production AI agents see up to 47% semantically similar repeated queries. Instead of routing all queries to a large model, this framework:
- Retrieves prior conversational context from memory for each query
- Routes queries with high-confidence memory hits to a lightweight 8B model
- Routes novel/low-confidence queries to the full-scale model
Results (without additional training): 8B + memory retrieval → 30.5% F1, recovering 69% of 235B model performance at 96% cost reduction.
Key insight: memory makes routing worthwhile; routing makes memory cost-effective.
Why Relevant to Zeph
Zeph already has SemanticMemory with MMR retrieval and PILOT LinUCB bandit routing. The gap: bandit routing makes decisions on task complexity (SLM audit) WITHOUT considering memory hit confidence. This paper shows memory retrieval quality is a strong routing signal: high-confidence recall means a cheap model is sufficient.
Integration sketch: add memory_confidence as a routing feature to LinUCB bandits alongside the existing complexity score. When memory_search returns similarity >= threshold (e.g. 0.9), route to fast provider. Extends RoutingContext with memory_hit_confidence: Option<f32>.
Complements #2415 (BaRP cost-weight dial) and extends existing PILOT bandit with a new memory-derived signal.
Priority Rationale
P2: directly extends Zeph's existing bandit routing infrastructure with a well-validated signal. 96% cost reduction is a compelling production metric. Zeph already has all required subsystems — this is a composition improvement.