Skip to content

research: task-continuation metric for post-compaction validation #1609

@bug-ops

Description

@bug-ops

Research Finding

Source: Factory.ai evaluation framework (2025), ICLR 2025
Applicability: Medium | Complexity: Simple

Problem

Zeph has no validation after context compaction. If summarization loses a critical fact (file path, decision, API key location), the next LLM call silently operates on incomplete context. This is the root cause of the class of bugs where agents 'forget' state after long sessions.

Proposed Approach

After each summarization event, run a lightweight 'compaction probe':

  1. Generate 2-3 factual questions from the original turns that are about to be summarized (e.g., 'What file was modified?', 'What was the user's goal?')
  2. Inject the new summary as context and ask the questions
  3. Score answers against expected (stored before compaction)
  4. If probe score < threshold: log WARN with question/answer pairs, optionally fall back to keeping original turns

The probe adds 1 extra LLM call per compaction event (mitigated by using a cheap fast model via the existing orchestrator).

Integration Points

  • crates/zeph-memory: validate_compaction(before_messages, summary) -> CompactionScore
  • [memory.compression] config: probe_enabled = false, probe_model (defaults to summary model), probe_threshold = 0.7
  • Debug dump: include compaction_probe section with questions, answers, score
  • CLI: no user-facing change (background validation)

Reference

Metadata

Metadata

Assignees

No one assigned

    Labels

    P1High ROI, low complexity — do next sprintenhancementNew feature or requestmemoryzeph-memory crate (SQLite)researchResearch-driven improvement

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions