-
Notifications
You must be signed in to change notification settings - Fork 2
Research: semantic response caching for LLM API cost reduction #1521
Copy link
Copy link
Closed
Labels
P1High ROI, low complexity — do next sprintHigh ROI, low complexity — do next sprintenhancementNew feature or requestNew feature or requestllmzeph-llm crate (Ollama, Claude)zeph-llm crate (Ollama, Claude)
Description
Context
Current response_cache_enabled in Zeph uses exact-match caching. Research indicates semantic caching (matching by embedding similarity rather than exact query text) can reduce LLM API calls by up to 69%.
Concept
Instead of caching only identical prompts, cache responses for semantically similar queries:
- Embed incoming user query
- Search cache for similar embeddings (threshold ~0.95)
- On cache hit: return cached response (sub-millisecond)
- On cache miss: call LLM, store response + embedding
Applicability to Zeph
- Already has embedding infrastructure (sqlite vector, Qdrant)
- Already has
response_cache_enabledconfig option — could extend - Most impactful for: repeated similar questions across sessions, skill-triggered prompts, experiment evaluator calls
- Less impactful for: unique creative queries, tool-heavy interactions
Trade-offs
- Requires embedding model (Ollama) — not available in Claude-only config
- Similarity threshold tuning: too low = stale/wrong answers, too high = no cache hits
- Cache invalidation: context changes may make cached responses incorrect
- Memory overhead: storing embeddings + responses
References
- AI Agent Architecture (Redis) — semantic caching pattern, 69% API call reduction, 15X faster responses
- Redis LangCache: vector-similarity based query caching
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
P1High ROI, low complexity — do next sprintHigh ROI, low complexity — do next sprintenhancementNew feature or requestNew feature or requestllmzeph-llm crate (Ollama, Claude)zeph-llm crate (Ollama, Claude)