Skip to content

feat(memory): semantic response caching with embedding similarity#2029

Merged
bug-ops merged 5 commits intomainfrom
issue-1521-semantic-caching
Mar 20, 2026
Merged

feat(memory): semantic response caching with embedding similarity#2029
bug-ops merged 5 commits intomainfrom
issue-1521-semantic-caching

Conversation

@bug-ops
Copy link
Copy Markdown
Owner

@bug-ops bug-ops commented Mar 20, 2026

Summary

Implement semantic cache alongside exact-match caching to reduce LLM API calls by matching user queries based on embedding similarity rather than exact text match.

Changes

  • ResponseCache extended with semantic_get(), put_with_embedding(), invalidate_embeddings_for_model(), cleanup()
  • CacheCheckResult enum ensures single embedding generation per request
  • Config: semantic_cache_enabled, semantic_cache_threshold (0.95), semantic_cache_max_candidates (10)
  • Agent loop integration: exact-match → semantic fallback → LLM call
  • Tool-call guard: skips semantic cache for context-sensitive responses
  • Migration 037: adds embedding BLOB, embedding_model, embedding_ts columns
  • 13 new unit tests, all 5953 tests pass

Test Plan

  • All unit tests pass (5953/5953)
  • cargo fmt --check clean
  • cargo clippy --workspace --features full -- -D warnings: 0 warnings
  • Exact-match cache still works (backward compat)
  • Semantic cache skipped for tool-call responses
  • Embedding computed once per request (CacheCheckResult)
  • SQL filters by embedding_model (cross-model prevention)
  • Security audit: LOW risk
  • Performance: single-embed saves 50–200ms per miss
  • Integration test with live Ollama (manual, requires Ollama running)

Follow-up Issues (will file)

  • PERF-SC-01 (MEDIUM): expires_at not in index
  • PERF-SC-02 (LOW): max_candidates undocumented recall limit
  • PERF-SC-03 (LOW): cleanup() not atomic
  • TC-01 (LOW): corrupted BLOB test
  • TC-02 (LOW): dimension mismatch test
  • MINOR-02 (LOW): integration test stubs for Ollama
  • SEC-01 (LOW): threshold NaN/Inf/range validation

Closes #1521

…ching

Implement semantic cache alongside exact-match caching to reduce LLM API calls by
matching user queries based on embedding similarity (~0.95 threshold) rather than
exact text match. Matches improved queries to cached responses from semantically
similar previous questions.

**Key changes:**
- ResponseCache extended with `get_semantic()`, `put_with_embedding()`,
  `invalidate_embeddings_for_model()`, `cleanup()` methods
- CacheCheckResult enum ensures embedding computed once per request (CRIT-01)
- SQL WHERE clause filters by embedding_model to prevent cross-model false positives (CRIT-02)
- Config: semantic_cache_enabled (default false), semantic_cache_threshold (0.95),
  semantic_cache_max_candidates (10)
- Agent loop: exact-match tried first (sub-ms), semantic as fallback (~150ms)
- Tool-call guard: semantic cache skipped for context-sensitive tool responses
- Migration 037: adds embedding BLOB, embedding_model, embedding_ts columns
- bytemuck zero-copy serialization for Vec<f32> embeddings

**Test coverage:** 13 new unit tests, all 5953 existing tests pass.
**Security:** LOW risk, parameterized SQL, local threat model, bytemuck safe.
**Performance:** Single-embed optimization saves 50–200ms per cache miss.
**Backward compat:** Old cache entries (NULL embeddings) still work via exact-match.

Closes #1521
@github-actions github-actions bot added enhancement New feature or request size/XL Extra large PR (500+ lines) labels Mar 20, 2026
@bug-ops bug-ops enabled auto-merge (squash) March 20, 2026 12:44
@bug-ops bug-ops merged commit 9e332a7 into main Mar 20, 2026
25 checks passed
@bug-ops bug-ops deleted the issue-1521-semantic-caching branch March 20, 2026 13:07
bug-ops added a commit that referenced this pull request Mar 20, 2026
Add 6 tests to response_cache.rs to verify cosine_similarity() and get_semantic()
gracefully handle embedding dimension mismatches:

- test_semantic_get_dimension_mismatch_returns_none: store dim=3, query dim=2
- test_semantic_get_dimension_mismatch_query_longer: store dim=2, query dim=3
- test_semantic_get_mixed_dimensions_picks_correct_match: mixed dims, verify correct match
- test_semantic_get_empty_embedding_skipped: empty embedding (BLOB x'') handling
- test_semantic_get_corrupt_blob_skipped: corrupt BLOB graceful skip
- test_semantic_get_all_corrupt_returns_none: all candidates corrupt/empty

Fixes #2034. Addresses PR #2029 review feedback. All tests use threshold=0.01
to correctly verify that cosine_similarity(mismatch)=0.0 does not produce false
hits (0.0 >= 0.0 would be true, so threshold must be > 0).

Tested: 5524 tests pass, no regressions. Verified by tester, perf, security,
and impl-critic agents.
bug-ops added a commit that referenced this pull request Mar 20, 2026
#2046)

Add 6 tests to response_cache.rs to verify cosine_similarity() and get_semantic()
gracefully handle embedding dimension mismatches:

- test_semantic_get_dimension_mismatch_returns_none: store dim=3, query dim=2
- test_semantic_get_dimension_mismatch_query_longer: store dim=2, query dim=3
- test_semantic_get_mixed_dimensions_picks_correct_match: mixed dims, verify correct match
- test_semantic_get_empty_embedding_skipped: empty embedding (BLOB x'') handling
- test_semantic_get_corrupt_blob_skipped: corrupt BLOB graceful skip
- test_semantic_get_all_corrupt_returns_none: all candidates corrupt/empty

Fixes #2034. Addresses PR #2029 review feedback. All tests use threshold=0.01
to correctly verify that cosine_similarity(mismatch)=0.0 does not produce false
hits (0.0 >= 0.0 would be true, so threshold must be > 0).

Tested: 5524 tests pass, no regressions. Verified by tester, perf, security,
and impl-critic agents.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

core zeph-core crate documentation Improvements or additions to documentation enhancement New feature or request memory zeph-memory crate (SQLite) rust Rust code changes size/XL Extra large PR (500+ lines)

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Research: semantic response caching for LLM API cost reduction

1 participant