-
Notifications
You must be signed in to change notification settings - Fork 2
bug(memory): tier promotion intermittent SQLite lock contention causes merge_failures #2511
Copy link
Copy link
Closed
Labels
P3Research — medium-high complexityResearch — medium-high complexitybugSomething isn't workingSomething isn't workingmemoryzeph-memory crate (SQLite)zeph-memory crate (SQLite)
Description
Summary
During live testing (CI-350, 2026-03-31), the tier promotion background task intermittently fails with a database is locked (SQLite error code 5) error:
WARN zeph_memory::tiers: tier promotion: cluster merge failed, skipping cluster_size=2 error=database error: error returned from database: (code: 5) database is locked
INFO zeph_memory::tiers: tier promotion: sweep complete candidates=7 clusters=1 promoted=0 merge_failures=1 elapsed_ms=2379
INFO zeph_memory::tiers: tier promotion: sweep complete candidates=7 clusters=1 promoted=1 merge_failures=0 elapsed_ms=1894
The merge failure is transient — the next sweep succeeds. The failure occurs when the tier promotion background writer contends with the main agent loop writer on the SQLite file DB (pool_size=5, busy_timeout=5s).
Root Cause
merge_cluster_and_promote() runs inside a background tokio task while the main agent loop holds a write transaction. WAL checkpointing or exclusive write locks by the agent loop can cause the background tier task to hit the OS-level lock before SQLite's busy_timeout kicks in.
Impact
- Tier promotion is skipped for that sweep but retried next interval (every ~30s)
- Memory consolidation is delayed, not lost
merge_failurescounter increments — no metric/alert exposed to user
Steps to Reproduce
- Use
testing.tomlwith[memory]graph enabled - Seed multiple memory entries across sessions
- Run a session that triggers both main-loop writes (graph entity extraction) and background tier promotion
- Observe
WARN zeph_memory::tiers: tier promotion: cluster merge failed
Expected Behavior
Tier promotion should not fail with lock contention errors. Either:
- Increase busy_timeout for the tier promotion transaction
- Use
BEGIN DEFERREDinstead ofBEGIN IMMEDIATEfor tier reads - Or add exponential backoff retry inside
merge_cluster_and_promote
Config
- testing.toml, file DB (not in-memory)
- graph memory enabled
- tier promotion background task active
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
P3Research — medium-high complexityResearch — medium-high complexitybugSomething isn't workingSomething isn't workingmemoryzeph-memory crate (SQLite)zeph-memory crate (SQLite)