Skip to content

Research: semantic response caching for LLM API cost reduction #1521

@bug-ops

Description

@bug-ops

Context

Current response_cache_enabled in Zeph uses exact-match caching. Research indicates semantic caching (matching by embedding similarity rather than exact query text) can reduce LLM API calls by up to 69%.

Concept

Instead of caching only identical prompts, cache responses for semantically similar queries:

  1. Embed incoming user query
  2. Search cache for similar embeddings (threshold ~0.95)
  3. On cache hit: return cached response (sub-millisecond)
  4. On cache miss: call LLM, store response + embedding

Applicability to Zeph

  • Already has embedding infrastructure (sqlite vector, Qdrant)
  • Already has response_cache_enabled config option — could extend
  • Most impactful for: repeated similar questions across sessions, skill-triggered prompts, experiment evaluator calls
  • Less impactful for: unique creative queries, tool-heavy interactions

Trade-offs

  • Requires embedding model (Ollama) — not available in Claude-only config
  • Similarity threshold tuning: too low = stale/wrong answers, too high = no cache hits
  • Cache invalidation: context changes may make cached responses incorrect
  • Memory overhead: storing embeddings + responses

References

  • AI Agent Architecture (Redis) — semantic caching pattern, 69% API call reduction, 15X faster responses
  • Redis LangCache: vector-similarity based query caching

Metadata

Metadata

Assignees

No one assigned

    Labels

    P1High ROI, low complexity — do next sprintenhancementNew feature or requestllmzeph-llm crate (Ollama, Claude)

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions