Skip to content

research(memory): agentic memory benchmarking harness — hit rate, recall latency, compression ratio metrics (arXiv:2602.19320) #2419

@bug-ops

Description

@bug-ops

Finding

Anatomy of Agentic Memory (arXiv:2602.19320)

Comprehensive taxonomy + evaluation framework for agent memory systems. Defines standardized metrics: recall hit rate, latency per recall, compression ratio, interference rate (new memories degrading old recalls), and context utilization efficiency.

Applicability to Zeph

Zeph has no systematic memory benchmarking. The journal tracks qualitative results ("cross-session recall works") but no quantitative metrics. A benchmarking harness would enable data-driven tuning of thresholds (cross_session_score_threshold, compaction_threshold, admission.threshold, etc.).

Proposed design:

# .local/testing/bench-memory.py
# 1. Seed N facts with known content
# 2. Run M recall queries at different time delays
# 3. Report: hit_rate, avg_recall_latency_ms, compression_ratio, interference_rate
python3 .local/testing/bench-memory.py --facts 50 --queries 100 --sessions 5

Metrics to track:

  • hit_rate — fraction of seeded facts recalled correctly
  • recall_latency_p50/p99 — milliseconds per memory_search call
  • compression_ratio — tokens before/after compaction per session
  • interference_rate — fraction of recalled facts that are contaminated by newer, unrelated memories

Priority

P3 — tooling improvement; enables data-driven tuning of existing thresholds.

Source

  • arXiv:2602.19320 — Anatomy of Agentic Memory: A Principled Survey and Evaluation Framework

Metadata

Metadata

Assignees

Labels

P3Research — medium-high complexitymemoryzeph-memory crate (SQLite)researchResearch-driven improvement

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions