Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Context Engineering

Zeph’s context engineering pipeline manages how information flows into the LLM context window. It combines semantic recall, proportional budget allocation, message trimming, environment injection, tool output management, and runtime compaction into a unified system.

All context engineering features are disabled by default (context_budget_tokens = 0). Set a non-zero budget or enable auto_budget = true to activate the pipeline.

Configuration

[memory]
context_budget_tokens = 128000    # Set to your model's context window size (0 = unlimited)
soft_compaction_threshold = 0.60  # Soft tier: prune tool outputs + apply deferred summaries (no LLM)
hard_compaction_threshold = 0.90  # Hard tier: full LLM summarization when usage exceeds this fraction
compaction_preserve_tail = 4      # Keep last N messages during compaction
prune_protect_tokens = 40000      # Protect recent N tokens from Tier 1 tool output pruning
cross_session_score_threshold = 0.35  # Minimum relevance for cross-session results (0.0-1.0)
tool_call_cutoff = 6              # Summarize oldest tool pair when visible pairs exceed this

[memory.semantic]
enabled = true                    # Required for semantic recall
recall_limit = 5                  # Max semantically relevant messages to inject

[memory.routing]
strategy = "heuristic"            # Query-aware memory backend selection

[memory.compression]
strategy = "proactive"            # "reactive" (default) or "proactive"
threshold_tokens = 80000          # Proactive: fire when context exceeds this (>= 1000)
max_summary_tokens = 4000         # Proactive: summary cap (>= 128)

[tools]
summarize_output = false          # Enable LLM-based tool output summarization

Context Window Layout

When context_budget_tokens > 0, the context window is structured as:

┌─────────────────────────────────────────────────┐
│ BASE_PROMPT (identity + guidelines + security)  │  ~300 tokens
├─────────────────────────────────────────────────┤
│ <environment> cwd, git branch, os, model        │  ~50 tokens
├─────────────────────────────────────────────────┤
│ <project_context> ZEPH.md contents              │  0-500 tokens
├─────────────────────────────────────────────────┤
│ <repo_map> structural overview (if index on)    │  0-1024 tokens
├─────────────────────────────────────────────────┤
│ <available_skills> matched skills (full body)   │  200-2000 tokens
│ <other_skills> remaining (description-only)     │  50-200 tokens
├─────────────────────────────────────────────────┤
│ [knowledge graph] entity facts (if graph on)    │  3% of available
├─────────────────────────────────────────────────┤
│ <code_context> RAG chunks (if index on)         │  30% of available
├─────────────────────────────────────────────────┤
│ [semantic recall] relevant past messages        │  5-8% of available
├─────────────────────────────────────────────────┤
│ [known facts] graph entity-relationship facts   │  0-4% of available
├─────────────────────────────────────────────────┤
│ [compaction summary] if compacted               │  200-500 tokens
├─────────────────────────────────────────────────┤
│ Recent message history                          │  50-60% of available
├─────────────────────────────────────────────────┤
│ [reserved for response generation]              │  20% of total
└─────────────────────────────────────────────────┘

Parallel Context Preparation

Context sources (summaries, cross-session recall, semantic recall, code RAG) are fetched concurrently via tokio::try_join!, reducing context build latency to the slowest single source rather than the sum of all.

Proportional Budget Allocation

Available tokens (after reserving 20% for response) are split proportionally. When code indexing is enabled, the code context slot takes a share from summaries, recall, and history. When graph memory is enabled, an additional 4% is allocated for graph facts, reducing summaries, semantic recall, cross-session, and code context by 1% each:

AllocationWithout code indexWith code indexWith graph memoryPurpose
Summaries15%8%7%Conversation summaries from SQLite
Semantic recall25%8%7%Relevant messages from past conversations via Qdrant
Cross-session4%3%Messages from other conversations
Code context30%29%Retrieved code chunks from project index
Graph facts4%Entity-relationship facts from graph memory
Recent history60%50%50%Most recent messages in current conversation

Note: The “With graph memory” column assumes code indexing is also enabled. Graph facts receive 0 tokens when the graph-memory feature is disabled or [memory.graph] enabled = false.

Semantic Recall Injection

When semantic memory is enabled, the agent queries the vector backend for messages relevant to the current user query. Two optional post-processing stages improve result quality:

  • Temporal decay — exponential score attenuation based on message age. Configure via memory.semantic.temporal_decay_enabled and temporal_decay_half_life_days (default: 30).
  • MMR re-ranking — Maximal Marginal Relevance diversifies results by penalizing similarity to already-selected items. Configure via memory.semantic.mmr_enabled and mmr_lambda (default: 0.7, range 0.0-1.0).

Results are injected as transient system messages (prefixed with [semantic recall]) that are:

  • Removed and re-injected on every turn (never stale)
  • Not persisted to SQLite
  • Bounded by the allocated token budget (25%, or 10% when code indexing is enabled)

Requires Qdrant and memory.semantic.enabled = true.

Message History Trimming

When recent messages exceed the 60% budget allocation, the oldest non-system messages are evicted. The system prompt and most recent messages are always preserved.

Environment Context

Every system prompt rebuild injects an <environment> block with:

  • Working directory
  • OS (linux, macos, windows)
  • Current git branch (if in a git repo)
  • Active model name

EnvironmentContext is built once at agent bootstrap and cached. On skill hot-reload, only git_branch and model_name are refreshed. This avoids spawning a git subprocess on every agent turn.

Tool-Pair Summarization

After each tool execution, maybe_summarize_tool_pair() checks whether the number of unsummarized tool call/response pairs exceeds tool_call_cutoff (default: 6). When the threshold is exceeded, the oldest eligible pair is summarized via LLM and the result is stored as a deferred summary. Summaries are applied lazily when context usage exceeds soft_compaction_threshold (default: 0.60), preserving the message prefix for API cache hits.

How It Works

  1. count_unsummarized_pairs() scans for consecutive Assistant(ToolUse) + User(ToolResult/ToolOutput) pairs where both have agent_visible = true and no deferred_summary is pending.
  2. If the count exceeds tool_call_cutoff, find_oldest_unsummarized_pair() locates the first eligible pair (skipping pairs with pruned content).
  3. build_tool_pair_summary_prompt() constructs a prompt with XML-delimited sections (<tool_request> and <tool_response>) to prevent content injection.
  4. The summary provider generates a 1-2 sentence summary capturing tool name, key parameters, and outcome.
  5. The summary is stored in messages[resp_idx].metadata.deferred_summary — the original messages remain visible.
  6. When context usage exceeds soft_compaction_threshold, apply_deferred_summaries() batch-applies all pending summaries: hides the original pairs and inserts Assistant Summary messages.

Visibility After Summarization

Messageagent_visibleuser_visibleAppears in
Original tool requestfalsetrueUI only
Original tool responsefalsetrueUI only
[tool summary] messagetruefalseLLM context only

Summarization runs synchronously between tool iterations. If the LLM call fails, the error is logged and the pair is left unsummarized.

Summary Provider Configuration

By default, tool-pair summarization uses the primary LLM provider. You can dedicate a faster or cheaper model to this task using either the structured [llm.summary_provider] section or the summary_model string shorthand.

[llm.summary_provider] uses the same struct as [[llm.providers]] entries:

# Claude — model falls back to the claude provider entry when omitted
[llm.summary_provider]
type = "claude"
model = "claude-haiku-4-5-20251001"

# OpenAI — model/base_url fall back to the openai provider entry when omitted
[llm.summary_provider]
type = "openai"
model = "gpt-4o-mini"

# Ollama — model/base_url fall back to [llm] when omitted
[llm.summary_provider]
type = "ollama"
model = "qwen3:1.7b"
base_url = "http://localhost:11434"

# OpenAI-compatible server — `model` is the entry name in [[llm.providers]]
[[llm.providers]]
name = "lm-studio"
type = "compatible"
base_url = "http://localhost:8080/v1"
model = "llama-3.2-1b"

[llm.summary_provider]
type = "compatible"
model = "lm-studio"   # matches [[llm.providers]] name field

# Local candle inference (requires candle feature)
[llm.summary_provider]
type = "candle"
model = "mistral-7b-instruct"   # HuggingFace repo_id; overrides [llm.candle]
device = "metal"                 # "cpu", "cuda", or "metal"; overrides [llm.candle].device

Fields:

FieldRequiredDescription
typeyesclaude, openai, compatible, ollama, or candle
modelnoModel name override (for compatible: the [[llm.providers]] entry name)
base_urlnoOverride endpoint URL (ollama and openai only)
embedding_modelnoOverride embedding model (ollama and openai only)
devicenoInference device: cpu, cuda, metal (candle only)

String shorthand (summary_model)

summary_model accepts a compact provider/model string. [llm.summary_provider] takes precedence when both are set.

[llm]
summary_model = "claude"                              # Claude with model from the claude provider entry
summary_model = "claude/claude-haiku-4-5-20251001"   # Claude with explicit model
summary_model = "openai"                              # OpenAI with model from the openai provider entry
summary_model = "openai/gpt-4o-mini"                 # OpenAI with explicit model
summary_model = "compatible/my-server"               # OpenAI-compatible using [[llm.providers]] name
summary_model = "ollama/qwen3:1.7b"                  # Ollama with explicit model
summary_model = "candle"                              # Local candle inference

Query-Aware Memory Routing

When semantic memory is enabled, the MemoryRouter trait decides which backend(s) to query for each recall request. The default HeuristicRouter classifies queries based on lexical cues:

  • Keyword (SQLite FTS5 only) — code patterns (::, /), pure snake_case identifiers, short queries (<=3 words without question words)
  • Semantic (Qdrant vectors only) — natural language questions (what, how, why, …), long queries (>=6 words)
  • Hybrid (both + reciprocal rank fusion) — medium-length queries without clear signals
  • Graph (graph store + hybrid fallback) — relationship patterns (related to, opinion on, connection between, know about). Triggers graph_recall BFS traversal in addition to hybrid message recall. Requires the graph-memory feature; falls back to Hybrid when disabled

Relationship patterns take priority over all other heuristics.

Configure via [memory.routing]:

[memory.routing]
strategy = "heuristic"   # Only option currently; selected by default

When Qdrant is unavailable, Semantic-route queries return empty results and Hybrid-route queries fall back to FTS5 only.

Proactive Context Compression

By default, context compression is reactive — it fires only when the two-tier pruning pipeline detects threshold overflow. Proactive compression fires earlier, based on an absolute token count threshold, to prevent overflow altogether.

[memory.compression]
strategy = "proactive"
threshold_tokens = 80000       # Compress when context exceeds this (>= 1000)
max_summary_tokens = 4000      # Cap for the compressed summary (>= 128)

Proactive compression runs at the start of the context management phase, before reactive compaction. If proactive compression fires, reactive compaction is skipped for that turn (mutual exclusion via compacted_this_turn flag, reset each turn).

Metrics: compression_events (count), compression_tokens_saved (cumulative tokens freed).

Failure-Driven Compression Guidelines

Zeph can learn from its own compaction mistakes using the ACON (Adaptive COmpaction with Notes) mechanism. When [memory.compression_guidelines] is enabled:

  1. After each hard compaction event, the agent opens a detection window spanning detection_window_turns turns.
  2. Within that window, every LLM response is scanned for a two-signal pattern: an uncertainty phrase (e.g. “I don’t recall”, “I’m not sure”) and a prior-context reference (e.g. “earlier you mentioned”, “we discussed”). Both signals must appear together — this two-signal requirement reduces false positives.
  3. Confirmed failure pairs (compressed context snapshot + failure reason) are stored in compression_failure_pairs in SQLite.
  4. A background task wakes every update_interval_secs seconds. When the count of unprocessed pairs reaches update_threshold, it calls the LLM with a synthesis prompt that includes the current guidelines and the new failure pairs.
  5. The LLM produces an updated numbered list of preservation rules. The output is sanitized (prompt injection patterns stripped, length bounded by max_guidelines_tokens), then stored atomically using a single INSERT ... SELECT COALESCE(MAX(version), 0) + 1 statement that eliminates TOCTOU version conflicts.
  6. Every subsequent compaction injects the active guidelines inside a <compression-guidelines> block, steering the summarizer to preserve previously-lost information categories.

Configuration:

[memory.compression_guidelines]
enabled = true
update_threshold = 5             # Failure pairs needed to trigger a guidelines update (default: 5)
max_guidelines_tokens = 500      # Token budget for the synthesized guidelines (default: 500)
max_pairs_per_update = 10        # Pairs consumed per update cycle (default: 10)
detection_window_turns = 10      # Turns to watch for context loss after hard compaction (default: 10)
update_interval_secs = 300       # Background updater interval in seconds (default: 300)
max_stored_pairs = 100           # Cleanup threshold for stored failure pairs (default: 100)

The feature is opt-in (enabled = false by default). When disabled, compression prompts are unchanged and no failure pairs are recorded. Guidelines accumulate incrementally across sessions — the agent improves its compression behavior over time.

Two-Tier Reactive Compaction

When context usage crosses predefined thresholds, a two-tier compaction strategy activates. Each tier is cheaper than the next. Tier 0 (eager deferred summaries) runs continuously during tool loops independently of these tiers.

Soft Tier: Apply Deferred Summaries + Prune Tool Outputs (at soft_compaction_threshold)

When context usage exceeds soft_compaction_threshold (default: 0.60), Zeph first batch-applies all pending deferred summaries (in-memory, no LLM call), then prunes tool outputs outside the protected tail. This tier does not prevent the hard tier from firing in the same turn.

The soft tier also fires mid-iteration inside tool execution loops (via maybe_soft_compact_mid_iteration()), after summarization and stale pruning. This prevents large tool outputs from pushing context past the hard threshold within a single LLM turn without touching turn counters or cooldown.

Why lazy application? Tool pair summaries are computed eagerly (right after each tool call) but their application to the message array is deferred. As long as context usage stays below 0.60, the original tool call/response messages remain in the array unchanged. This keeps the message prefix stable across consecutive turns, which is the key requirement for the Claude API prompt cache to produce hits.

Hard Tier: Selective Tool Output Pruning + LLM Compaction (at hard_compaction_threshold)

When context usage exceeds hard_compaction_threshold (default: 0.90), Zeph applies deferred summaries, prunes tool outputs, and — if pruning is insufficient — falls back to full LLM-based chunked compaction. Once hard compaction fires, it sets compacted_this_turn to prevent double LLM summarization.

Zeph scans messages outside the protected tail for ToolOutput parts and replaces their content with a short placeholder. This is a cheap, synchronous operation that often frees enough tokens to stay under the threshold without an LLM call.

  • Only tool outputs in messages older than the protected tail are pruned
  • The most recent prune_protect_tokens tokens (default: 40,000) worth of messages are never pruned, preserving recent tool context
  • Pruned parts have their compacted_at timestamp set, body is cleared from memory to reclaim heap, and they are not pruned again
  • Pruned parts are persisted to SQLite before clearing, so pruning state survives session restarts
  • The tool_output_prunes metric tracks how many parts were pruned

Chunked LLM Compaction (Hard Tier Fallback)

If Tier 1 does not free enough tokens, adaptive chunked compaction runs:

  1. Middle messages (between system prompt and last N recent) are split into ~4096-token chunks
  2. Chunks are summarized in parallel via futures::stream::buffer_unordered(4) — up to 4 concurrent LLM calls
  3. Partial summaries are merged into a final summary via a second LLM pass
  4. replace_conversation() atomically updates the compacted range and inserts the summary in SQLite
  5. Last compaction_preserve_tail messages (default: 4) are always preserved

If a single chunk fits all messages, or if chunked summarization fails, the system falls back to a single-pass summarization over the full message range.

Both tiers are idempotent and run automatically during the agent loop.

Post-Compression Validation (Compaction Probe)

After hard-tier LLM compaction produces a candidate summary, an optional validation step can verify that the summary preserves critical facts before committing it. The compaction probe generates factual questions from the original messages, answers them using only the summary, and scores the answers. The probe runs only during hard-tier compaction events — soft-tier pruning and deferred summaries are not validated.

The feature is disabled by default ([memory.compression.probe] enabled = false). On errors or timeouts, the probe fails open — compaction proceeds without validation.

How It Works

  1. After summarize_messages() produces a summary, the probe generates up to max_questions factual questions from the original messages. Tool output bodies are truncated to 500 characters to focus on decisions and outcomes.
  2. Questions target concrete details: file paths, function/struct names, architectural decisions, config values, error messages, and action items.
  3. A second LLM call answers the questions using ONLY the summary text. If information is absent from the summary, the model answers “UNKNOWN”.
  4. Answers are scored against expected values using token-set-ratio similarity (Jaccard-based with substring boost). Refusal patterns (“unknown”, “not mentioned”, “n/a”, etc.) score 0.0.
  5. The average score determines the verdict.

If the probe generates fewer than 2 questions (e.g., very short conversations with insufficient factual content), the probe is skipped and compaction proceeds without validation.

Verdict Behavior

VerdictScore Range (defaults)ActionMetric incremented
Pass>= 0.60Commit summarycompaction_probe_passes
SoftFail[0.35, 0.60)Commit summary + WARN logcompaction_probe_soft_failures
HardFail< 0.35Block compaction, preserve original messagescompaction_probe_failures
ErrorN/A (LLM/timeout)Non-blocking, proceed with compactioncompaction_probe_errors

When HardFail blocks compaction, the outcome is ProbeRejected. This sets an internal cooldown but does NOT trigger the Exhausted state — the compactor can retry on a later turn with new messages.

User-Facing Messages

  • During probe: status indicator shows “Validating compaction quality…”
  • HardFail (via /compact): “Compaction rejected: summary quality below threshold. Original context preserved.”
  • SoftFail: warning in logs only; user sees normal “Context compacted successfully.”
  • Pass: normal “Context compacted successfully.”

Configuration

[memory.compression.probe]
enabled = false           # Enable compaction probe validation (default: false)
model = ""                # Model for probe LLM calls (empty = summary provider)
threshold = 0.6           # Minimum score to pass without warnings
hard_fail_threshold = 0.35 # Score below this blocks compaction (HardFail)
max_questions = 3         # Maximum factual questions per probe
timeout_secs = 15         # Timeout for the entire probe (both LLM calls)
FieldTypeDefaultDescription
enabledbooleanfalseEnable probe validation after each hard compaction
modelstring""Model override for probe LLM calls. Empty = use summary provider. Non-Haiku models increase cost (~10x)
thresholdfloat0.6Minimum average score for Pass verdict
hard_fail_thresholdfloat0.35Score below this triggers HardFail (blocks compaction)
max_questionsinteger3Number of factual questions generated per probe
timeout_secsinteger15Timeout for both LLM calls combined

Threshold tuning:

  • Decrease threshold to 0.45-0.50 for creative or conversational sessions where verbatim detail preservation matters less.
  • Raise threshold to 0.75-0.80 for coding sessions where file paths and architectural decisions must survive compaction.
  • Keep a gap of at least 0.15-0.20 between hard_fail_threshold and threshold to maintain a meaningful SoftFail range.
  • max_questions = 3 balances probe accuracy against latency and cost. Increase to 5 for higher statistical power at the expense of slower probes.

Debug Dump Output

When debug dump is enabled, each probe writes a {id:04}-compaction-probe.json file with the full probe result:

{
  "score": 0.75,
  "threshold": 0.6,
  "hard_fail_threshold": 0.35,
  "verdict": "Pass",
  "model": "claude-haiku-4-5-20251001",
  "duration_ms": 2340,
  "questions": [
    {
      "question": "What file was modified to fix the auth bug?",
      "expected": "crates/zeph-core/src/auth.rs",
      "actual": "The file crates/zeph-core/src/auth.rs was modified",
      "score": 1.0
    }
  ]
}

The questions array merges question text, expected answer, actual LLM answer, and per-question score into a single object per question for easy inspection.

Troubleshooting

Frequent HardFail verdicts

  • The summary model may be too small for the conversation complexity. Try a larger model via model = "claude-sonnet-4-5-20250514" (higher cost).
  • Lower hard_fail_threshold if false negatives are common (probe is too strict).
  • Increase max_questions to 5 for more statistical power (increases latency).

Probe always returns SoftFail

  • Check debug dump: if per-question scores show one strong and one weak answer, the summary may be partially lossy. This is expected behavior — SoftFail means “good enough” and does not block compaction.
  • Consider enabling Failure-Driven Compression Guidelines to teach the summarizer what to preserve.

Probe timeout warnings

  • Default 15s should be sufficient for most models. Increase timeout_secs for slow providers (e.g., local Ollama with large models).
  • On timeout, compaction proceeds without validation (fail-open).

Performance considerations

  • Each probe makes 2 LLM calls (question generation + answer verification).
  • With Haiku: ~$0.001-0.003 per probe, 1-3 seconds latency.
  • With Sonnet: ~$0.01-0.03 per probe, 2-5 seconds latency.
  • Probes run only during hard compaction events, not on every turn.
  • The probe timeout does not affect the main agent loop — it only gates whether the compaction summary is committed.

Metrics

MetricDescription
compaction_probe_passesTotal Pass verdicts
compaction_probe_soft_failuresTotal SoftFail verdicts
compaction_probe_failuresTotal HardFail verdicts (compaction blocked)
compaction_probe_errorsTotal Error verdicts (LLM/timeout, non-blocking)
last_probe_verdictMost recent verdict (Pass/SoftFail/HardFail/Error)
last_probe_scoreMost recent probe score in [0.0, 1.0]

Compaction Loop Prevention

maybe_compact() tracks whether compaction is making progress. The compaction_exhausted flag is set permanently when any of the following conditions are detected after a hard-tier attempt:

  • Fewer than 2 messages are eligible for compaction (nothing useful to summarize).
  • The LLM summary consumes as many tokens as were freed — net reduction is zero.
  • Context usage remains above hard_compaction_threshold even after a successful summarization pass.

Once exhausted, all further compaction calls are skipped for the session. A one-time warning is emitted to the user channel and to the log (WARN level):

Warning: context budget is too tight — compaction cannot free enough space.
Consider increasing [memory] context_budget_tokens or starting a new session.

This prevents infinite compaction loops when the configured budget is smaller than the minimum required for the system prompt and response reservation combined.

Structured Anchored Summarization

When hard compaction fires, the summarizer can produce structured AnchoredSummary objects with five mandatory sections:

SectionContent
session_intentWhat the user is trying to accomplish
files_modifiedFile paths, function names, structs touched
decisions_madeArchitectural decisions with rationale
open_questionsUnresolved items or ambiguities
next_stepsConcrete actions to take immediately

Anchored summaries are validated for completeness (session_intent and next_steps must be non-empty) and rendered as Markdown with [anchored summary] headers. This structured format reduces information loss compared to the free-form 9-section prompt below.

Subgoal-Aware Compaction

When task orchestration is active, the SubgoalRegistry tracks which messages belong to each subgoal and their state (Active, Completed, Abandoned). During hard compaction:

  • Messages in active subgoal ranges are preserved unconditionally
  • Messages in completed subgoal ranges are aggressively compacted
  • The registry state is dumped alongside each compaction event when debug dump is enabled ({id:04}-subgoal-registry.txt)

This prevents compaction from destroying the context that an in-progress orchestration task depends on.

Structured Compaction Prompt

Compaction summaries use a 9-section structured prompt designed for self-consumption. The LLM is instructed to produce exactly these sections:

  1. User Intent — what the user is ultimately trying to accomplish
  2. Technical Concepts — key technologies, patterns, constraints discussed
  3. Files & Code — file paths, function names, structs, enums touched or relevant
  4. Errors & Fixes — every error encountered and whether/how it was resolved
  5. Problem Solving — approaches tried, decisions made, alternatives rejected
  6. User Messages — verbatim user requests that are still pending or relevant
  7. Pending Tasks — items explicitly promised or left TODO
  8. Current Work — the exact task in progress at the moment of compaction
  9. Next Step — the single most important action to take immediately after compaction

The prompt favors thoroughness over brevity: longer summaries that preserve actionable detail are preferred over terse ones. When multiple chunks are summarized in parallel, a consolidation pass merges partial summaries into the same 9-section structure.

Progressive Tool Response Removal

When the LLM compaction itself hits a context length error (the messages being compacted are too large for the summarization model), summarize_messages() applies progressive middle-out tool response removal before retrying:

TierFraction removedDescription
110%Remove ~10% of tool responses from the center outward
220%Increase removal to ~20%
350%Remove half of all tool responses
4100%Remove all tool responses

The middle-out strategy starts removal from the center of the tool response list and alternates outward toward the edges. This preserves the earliest responses (which establish context) and the most recent ones (which reflect current work), while discarding the middle of the conversation first.

At each tier, ToolResult content is replaced with [compacted] and ToolOutput bodies are cleared (with compacted_at timestamp set). The reduced message set is then retried through the LLM summarization pipeline.

Metadata-Only Fallback

If all LLM summarization attempts fail (including after 100% tool response removal), build_metadata_summary() produces a lightweight summary without any LLM call:

[metadata summary — LLM compaction unavailable]
Messages compacted: 47 (23 user, 22 assistant, 2 system)
Last user message: <first 200 chars of last user message>
Last assistant message: <first 200 chars of last assistant message>

Text previews use safe UTF-8 truncation (truncate_chars()) that never splits a Unicode scalar value. This fallback guarantees that compaction always succeeds, even when the LLM is unreachable or the context is too large for any available model.

Reactive Retry on Context Length Errors

LLM calls in the agent loop (call_llm_with_retry() and call_chat_with_tools_retry()) intercept context length errors and automatically compact before retrying. The flow:

  1. Send messages to the LLM provider
  2. If the provider returns a context length error, trigger compact_context()
  3. Retry the LLM call with the compacted context
  4. If the error persists after max_attempts (default: 2), propagate the error

Non-context-length errors (rate limits, network failures, etc.) are propagated immediately without retry.

Context Length Error Detection

LlmError::is_context_length_error() detects context overflow across providers via pattern matching on error messages:

ProviderMatched patterns
Claude"maximum number of tokens"
OpenAI"maximum context length", "context_length_exceeded"
Ollama"context length exceeded", "prompt is too long", "input too long"

The dedicated LlmError::ContextLengthExceeded variant is also recognized. This unified detection allows the retry logic to work identically across all supported LLM backends.

Dual-Visibility Compaction

Compaction is non-destructive. Each Message carries MessageMetadata with agent_visible and user_visible flags:

Message stateagent_visibleuser_visibleAppears in
NormaltruetrueLLM context + UI
Compacted originalfalsetrueUI only
Compaction summarytruefalseLLM context only

replace_conversation() performs both updates atomically in a single SQLite transaction: it sets agent_visible=0, compacted_at=<timestamp> on the compacted range, then inserts the summary with agent_visible=1, user_visible=0. This guarantees the user retains full scroll-back history while the LLM sees only the compact summary.

Semantic recall (vector + FTS5) filters by agent_visible=1, so compacted originals are excluded from retrieval. Use load_history_filtered(conversation_id, agent_visible, user_visible) to query messages by visibility.

Native compress_context Tool

When the context-compression feature is enabled, Zeph registers a compress_context native tool that the model can invoke explicitly to trigger context compression on demand — without waiting for the automatic threshold-based compaction pipeline to fire.

The tool supports two compression strategies:

StrategyBehavior
ReactiveApply pending deferred summaries and prune old tool outputs (no LLM call). Equivalent to a soft-tier compaction triggered on demand.
AutonomousRun full LLM-based chunked compaction immediately, regardless of current token usage. The model decides when to invoke this based on its own assessment of context quality.

Autonomous mode uses the compress_provider for the summarization call. Configure it in [memory.compression]:

[memory.compression]
compress_provider = "fast"   # Provider name for autonomous compress_context calls

When compress_provider is unset, the default LLM provider is used. The compress_context tool does not appear in the tool catalog when the context-compression feature is disabled at build time.

Invocation:

The model calls the tool with a strategy parameter:

{ "strategy": "Autonomous" }

After execution, the tool returns a summary of tokens freed and the compaction outcome. The result is visible in the chat panel and in the debug dump.

Tool Output Management

Truncation

Tool outputs exceeding 30,000 characters are automatically truncated using a head+tail split with UTF-8 safe boundaries. Both the first and last ~15K chars are preserved.

Smart Summarization

When tools.summarize_output = true, long tool outputs are sent through the LLM with a prompt that preserves file paths, error messages, and numeric values. On LLM failure, falls back to truncation.

export ZEPH_TOOLS_SUMMARIZE_OUTPUT=true

Skill Prompt Modes

The skills.prompt_mode setting controls how matched skills are rendered in the system prompt:

ModeBehavior
fullFull XML skill bodies with instructions, examples, and references
compactCondensed XML with name, description, and trigger list only (~80% smaller)
auto (default)Selects compact when the remaining context budget is below 8192 tokens, full otherwise
[skills]
prompt_mode = "auto"  # "full", "compact", or "auto"

compact mode is useful for small context windows or when many skills are active. It preserves enough information for the model to select the right skill while minimizing token consumption.

Progressive Skill Loading

Skills matched by embedding similarity (top-K) are injected with their full body (or compact summary, depending on prompt_mode). Remaining skills are listed in a description-only <other_skills> catalog — giving the model awareness of all capabilities while consuming minimal tokens.

ZEPH.md Project Config

Zeph walks up the directory tree from the current working directory looking for:

  • ZEPH.md
  • ZEPH.local.md
  • .zeph/config.md

Found configs are concatenated (global first, then ancestors from root to cwd) and injected into the system prompt as a <project_context> block. Use this to provide project-specific instructions.

Environment Variables

VariableDescriptionDefault
ZEPH_MEMORY_CONTEXT_BUDGET_TOKENSContext budget in tokens0 (unlimited)
ZEPH_MEMORY_SOFT_COMPACTION_THRESHOLDSoft compaction threshold: prune tool outputs + apply deferred summaries (no LLM)0.60
ZEPH_MEMORY_COMPACTION_THRESHOLDHard compaction threshold (backward compat alias for hard_compaction_threshold)0.90
ZEPH_MEMORY_COMPACTION_PRESERVE_TAILMessages preserved during compaction4
ZEPH_MEMORY_PRUNE_PROTECT_TOKENSTokens protected from Tier 1 tool output pruning40000
ZEPH_MEMORY_CROSS_SESSION_SCORE_THRESHOLDMinimum relevance score for cross-session memory results0.35
ZEPH_MEMORY_TOOL_CALL_CUTOFFMax visible tool pairs before oldest is summarized6
ZEPH_MEMORY_SEMANTIC_TEMPORAL_DECAY_ENABLEDEnable temporal decay scoringfalse
ZEPH_MEMORY_SEMANTIC_TEMPORAL_DECAY_HALF_LIFE_DAYSHalf-life for temporal decay30
ZEPH_MEMORY_SEMANTIC_MMR_ENABLEDEnable MMR re-rankingfalse
ZEPH_MEMORY_SEMANTIC_MMR_LAMBDAMMR relevance-diversity trade-off0.7
ZEPH_TOOLS_SUMMARIZE_OUTPUTEnable LLM-based tool output summarizationfalse