Context Engineering

Zeph’s context engineering pipeline manages how information flows into the LLM context window. It combines semantic recall, proportional budget allocation, message trimming, environment injection, tool output management, and runtime compaction into a unified system.

All context engineering features are disabled by default (context_budget_tokens = 0). Set a non-zero budget or enable auto_budget = true to activate the pipeline.

Configuration

[memory]
context_budget_tokens = 128000    # Set to your model's context window size (0 = unlimited)
soft_compaction_threshold = 0.60  # Soft tier: prune tool outputs + apply deferred summaries (no LLM)
hard_compaction_threshold = 0.90  # Hard tier: full LLM summarization when usage exceeds this fraction
compaction_preserve_tail = 4      # Keep last N messages during compaction
prune_protect_tokens = 40000      # Protect recent N tokens from Tier 1 tool output pruning
cross_session_score_threshold = 0.35  # Minimum relevance for cross-session results (0.0-1.0)
tool_call_cutoff = 6              # Summarize oldest tool pair when visible pairs exceed this

[memory.semantic]
enabled = true                    # Required for semantic recall
recall_limit = 5                  # Max semantically relevant messages to inject

[memory.routing]
strategy = "heuristic"            # Query-aware memory backend selection

[memory.compression]
strategy = "proactive"            # "reactive" (default) or "proactive"
threshold_tokens = 80000          # Proactive: fire when context exceeds this (>= 1000)
max_summary_tokens = 4000         # Proactive: summary cap (>= 128)

[tools]
summarize_output = false          # Enable LLM-based tool output summarization

Context Window Layout

When context_budget_tokens > 0, the context window is structured as:

┌─────────────────────────────────────────────────┐
│ BASE_PROMPT (identity + guidelines + security)  │  ~300 tokens
├─────────────────────────────────────────────────┤
│ <environment> cwd, git branch, os, model        │  ~50 tokens
├─────────────────────────────────────────────────┤
│ <project_context> ZEPH.md contents              │  0-500 tokens
├─────────────────────────────────────────────────┤
│ <repo_map> structural overview (if index on)    │  0-1024 tokens
├─────────────────────────────────────────────────┤
│ <available_skills> matched skills (full body)   │  200-2000 tokens
│ <other_skills> remaining (description-only)     │  50-200 tokens
├─────────────────────────────────────────────────┤
│ [knowledge graph] entity facts (if graph on)    │  3% of available
├─────────────────────────────────────────────────┤
│ <code_context> RAG chunks (if index on)         │  30% of available
├─────────────────────────────────────────────────┤
│ [semantic recall] relevant past messages        │  5-8% of available
├─────────────────────────────────────────────────┤
│ [known facts] graph entity-relationship facts   │  0-4% of available
├─────────────────────────────────────────────────┤
│ [compaction summary] if compacted               │  200-500 tokens
├─────────────────────────────────────────────────┤
│ Recent message history                          │  50-60% of available
├─────────────────────────────────────────────────┤
│ [reserved for response generation]              │  20% of total
└─────────────────────────────────────────────────┘

Parallel Context Preparation

Context sources (summaries, cross-session recall, semantic recall, code RAG) are fetched concurrently via tokio::try_join!, reducing context build latency to the slowest single source rather than the sum of all.

Proportional Budget Allocation

Available tokens (after reserving 20% for response) are split proportionally. When code indexing is enabled, the code context slot takes a share from summaries, recall, and history. When graph memory is enabled, an additional 4% is allocated for graph facts, reducing summaries, semantic recall, cross-session, and code context by 1% each:

Allocation	Without code index	With code index	With graph memory	Purpose
Summaries	15%	8%	7%	Conversation summaries from SQLite
Semantic recall	25%	8%	7%	Relevant messages from past conversations via Qdrant
Cross-session	–	4%	3%	Messages from other conversations
Code context	–	30%	29%	Retrieved code chunks from project index
Graph facts	–	–	4%	Entity-relationship facts from graph memory
Recent history	60%	50%	50%	Most recent messages in current conversation

Note: The “With graph memory” column assumes code indexing is also enabled. Graph facts receive 0 tokens when the graph-memory feature is disabled or [memory.graph] enabled = false.

Semantic Recall Injection

When semantic memory is enabled, the agent queries the vector backend for messages relevant to the current user query. Two optional post-processing stages improve result quality:

Temporal decay — exponential score attenuation based on message age. Configure via memory.semantic.temporal_decay_enabled and temporal_decay_half_life_days (default: 30).
MMR re-ranking — Maximal Marginal Relevance diversifies results by penalizing similarity to already-selected items. Configure via memory.semantic.mmr_enabled and mmr_lambda (default: 0.7, range 0.0-1.0).

Results are injected as transient system messages (prefixed with [semantic recall]) that are:

Removed and re-injected on every turn (never stale)
Not persisted to SQLite
Bounded by the allocated token budget (25%, or 10% when code indexing is enabled)

Requires Qdrant and memory.semantic.enabled = true.

Message History Trimming

When recent messages exceed the 60% budget allocation, the oldest non-system messages are evicted. The system prompt and most recent messages are always preserved.

Environment Context

Every system prompt rebuild injects an <environment> block with:

Working directory
OS (linux, macos, windows)
Current git branch (if in a git repo)
Active model name

EnvironmentContext is built once at agent bootstrap and cached. On skill hot-reload, only git_branch and model_name are refreshed. This avoids spawning a git subprocess on every agent turn.

Tool-Pair Summarization

After each tool execution, maybe_summarize_tool_pair() checks whether the number of unsummarized tool call/response pairs exceeds tool_call_cutoff (default: 6). When the threshold is exceeded, the oldest eligible pair is summarized via LLM and the result is stored as a deferred summary. Summaries are applied lazily when context usage exceeds soft_compaction_threshold (default: 0.60), preserving the message prefix for API cache hits.

How It Works

count_unsummarized_pairs() scans for consecutive Assistant(ToolUse) + User(ToolResult/ToolOutput) pairs where both have agent_visible = true and no deferred_summary is pending.
If the count exceeds tool_call_cutoff, find_oldest_unsummarized_pair() locates the first eligible pair (skipping pairs with pruned content).
build_tool_pair_summary_prompt() constructs a prompt with XML-delimited sections (<tool_request> and <tool_response>) to prevent content injection.
The summary provider generates a 1-2 sentence summary capturing tool name, key parameters, and outcome.
The summary is stored in messages[resp_idx].metadata.deferred_summary — the original messages remain visible.
When context usage exceeds soft_compaction_threshold, apply_deferred_summaries() batch-applies all pending summaries: hides the original pairs and inserts Assistant Summary messages.

Visibility After Summarization

Message	`agent_visible`	`user_visible`	Appears in
Original tool request	`false`	`true`	UI only
Original tool response	`false`	`true`	UI only
`[tool summary]` message	`true`	`false`	LLM context only

Summarization runs synchronously between tool iterations. If the LLM call fails, the error is logged and the pair is left unsummarized.

Summary Provider Configuration

By default, tool-pair summarization uses the primary LLM provider. You can dedicate a faster or cheaper model to this task using either the structured [llm.summary_provider] section or the summary_model string shorthand.

Structured config (recommended)

[llm.summary_provider] uses the same struct as [[llm.providers]] entries:

# Claude — model falls back to the claude provider entry when omitted
[llm.summary_provider]
type = "claude"
model = "claude-haiku-4-5-20251001"

# OpenAI — model/base_url fall back to the openai provider entry when omitted
[llm.summary_provider]
type = "openai"
model = "gpt-4o-mini"

# Ollama — model/base_url fall back to [llm] when omitted
[llm.summary_provider]
type = "ollama"
model = "qwen3:1.7b"
base_url = "http://localhost:11434"

# OpenAI-compatible server — `model` is the entry name in [[llm.providers]]
[[llm.providers]]
name = "lm-studio"
type = "compatible"
base_url = "http://localhost:8080/v1"
model = "llama-3.2-1b"

[llm.summary_provider]
type = "compatible"
model = "lm-studio"   # matches [[llm.providers]] name field

# Local candle inference (requires candle feature)
[llm.summary_provider]
type = "candle"
model = "mistral-7b-instruct"   # HuggingFace repo_id; overrides [llm.candle]
device = "metal"                 # "cpu", "cuda", or "metal"; overrides [llm.candle].device

Fields:

Field	Required	Description
`type`	yes	`claude`, `openai`, `compatible`, `ollama`, or `candle`
`model`	no	Model name override (for `compatible`: the `[[llm.providers]]` entry name)
`base_url`	no	Override endpoint URL (`ollama` and `openai` only)
`embedding_model`	no	Override embedding model (`ollama` and `openai` only)
`device`	no	Inference device: `cpu`, `cuda`, `metal` (`candle` only)

String shorthand (`summary_model`)

summary_model accepts a compact provider/model string. [llm.summary_provider] takes precedence when both are set.

[llm]
summary_model = "claude"                              # Claude with model from the claude provider entry
summary_model = "claude/claude-haiku-4-5-20251001"   # Claude with explicit model
summary_model = "openai"                              # OpenAI with model from the openai provider entry
summary_model = "openai/gpt-4o-mini"                 # OpenAI with explicit model
summary_model = "compatible/my-server"               # OpenAI-compatible using [[llm.providers]] name
summary_model = "ollama/qwen3:1.7b"                  # Ollama with explicit model
summary_model = "candle"                              # Local candle inference

Query-Aware Memory Routing

When semantic memory is enabled, the MemoryRouter trait decides which backend(s) to query for each recall request. The default HeuristicRouter classifies queries based on lexical cues:

Keyword (SQLite FTS5 only) — code patterns (::, /), pure snake_case identifiers, short queries (<=3 words without question words)
Semantic (Qdrant vectors only) — natural language questions (what, how, why, …), long queries (>=6 words)
Hybrid (both + reciprocal rank fusion) — medium-length queries without clear signals
Graph (graph store + hybrid fallback) — relationship patterns (related to, opinion on, connection between, know about). Triggers graph_recall BFS traversal in addition to hybrid message recall. Requires the graph-memory feature; falls back to Hybrid when disabled

Relationship patterns take priority over all other heuristics.

Configure via [memory.routing]:

[memory.routing]
strategy = "heuristic"   # Only option currently; selected by default

When Qdrant is unavailable, Semantic-route queries return empty results and Hybrid-route queries fall back to FTS5 only.

Proactive Context Compression

By default, context compression is reactive — it fires only when the two-tier pruning pipeline detects threshold overflow. Proactive compression fires earlier, based on an absolute token count threshold, to prevent overflow altogether.

[memory.compression]
strategy = "proactive"
threshold_tokens = 80000       # Compress when context exceeds this (>= 1000)
max_summary_tokens = 4000      # Cap for the compressed summary (>= 128)

Proactive compression runs at the start of the context management phase, before reactive compaction. If proactive compression fires, reactive compaction is skipped for that turn (mutual exclusion via compacted_this_turn flag, reset each turn).

Metrics: compression_events (count), compression_tokens_saved (cumulative tokens freed).

Failure-Driven Compression Guidelines

Zeph can learn from its own compaction mistakes using the ACON (Adaptive COmpaction with Notes) mechanism. When [memory.compression_guidelines] is enabled:

After each hard compaction event, the agent opens a detection window spanning detection_window_turns turns.
Within that window, every LLM response is scanned for a two-signal pattern: an uncertainty phrase (e.g. “I don’t recall”, “I’m not sure”) and a prior-context reference (e.g. “earlier you mentioned”, “we discussed”). Both signals must appear together — this two-signal requirement reduces false positives.
Confirmed failure pairs (compressed context snapshot + failure reason) are stored in compression_failure_pairs in SQLite.
A background task wakes every update_interval_secs seconds. When the count of unprocessed pairs reaches update_threshold, it calls the LLM with a synthesis prompt that includes the current guidelines and the new failure pairs.
The LLM produces an updated numbered list of preservation rules. The output is sanitized (prompt injection patterns stripped, length bounded by max_guidelines_tokens), then stored atomically using a single INSERT ... SELECT COALESCE(MAX(version), 0) + 1 statement that eliminates TOCTOU version conflicts.
Every subsequent compaction injects the active guidelines inside a <compression-guidelines> block, steering the summarizer to preserve previously-lost information categories.

Configuration:

[memory.compression_guidelines]
enabled = true
update_threshold = 5             # Failure pairs needed to trigger a guidelines update (default: 5)
max_guidelines_tokens = 500      # Token budget for the synthesized guidelines (default: 500)
max_pairs_per_update = 10        # Pairs consumed per update cycle (default: 10)
detection_window_turns = 10      # Turns to watch for context loss after hard compaction (default: 10)
update_interval_secs = 300       # Background updater interval in seconds (default: 300)
max_stored_pairs = 100           # Cleanup threshold for stored failure pairs (default: 100)

The feature is opt-in (enabled = false by default). When disabled, compression prompts are unchanged and no failure pairs are recorded. Guidelines accumulate incrementally across sessions — the agent improves its compression behavior over time.

Two-Tier Reactive Compaction

When context usage crosses predefined thresholds, a two-tier compaction strategy activates. Each tier is cheaper than the next. Tier 0 (eager deferred summaries) runs continuously during tool loops independently of these tiers.

Soft Tier: Apply Deferred Summaries + Prune Tool Outputs (at `soft_compaction_threshold`)

When context usage exceeds soft_compaction_threshold (default: 0.60), Zeph first batch-applies all pending deferred summaries (in-memory, no LLM call), then prunes tool outputs outside the protected tail. This tier does not prevent the hard tier from firing in the same turn.

The soft tier also fires mid-iteration inside tool execution loops (via maybe_soft_compact_mid_iteration()), after summarization and stale pruning. This prevents large tool outputs from pushing context past the hard threshold within a single LLM turn without touching turn counters or cooldown.

Why lazy application? Tool pair summaries are computed eagerly (right after each tool call) but their application to the message array is deferred. As long as context usage stays below 0.60, the original tool call/response messages remain in the array unchanged. This keeps the message prefix stable across consecutive turns, which is the key requirement for the Claude API prompt cache to produce hits.

Hard Tier: Selective Tool Output Pruning + LLM Compaction (at `hard_compaction_threshold`)

When context usage exceeds hard_compaction_threshold (default: 0.90), Zeph applies deferred summaries, prunes tool outputs, and — if pruning is insufficient — falls back to full LLM-based chunked compaction. Once hard compaction fires, it sets compacted_this_turn to prevent double LLM summarization.

Zeph scans messages outside the protected tail for ToolOutput parts and replaces their content with a short placeholder. This is a cheap, synchronous operation that often frees enough tokens to stay under the threshold without an LLM call.

Only tool outputs in messages older than the protected tail are pruned
The most recent prune_protect_tokens tokens (default: 40,000) worth of messages are never pruned, preserving recent tool context
Pruned parts have their compacted_at timestamp set, body is cleared from memory to reclaim heap, and they are not pruned again
Pruned parts are persisted to SQLite before clearing, so pruning state survives session restarts
The tool_output_prunes metric tracks how many parts were pruned

Chunked LLM Compaction (Hard Tier Fallback)

If Tier 1 does not free enough tokens, adaptive chunked compaction runs:

Middle messages (between system prompt and last N recent) are split into ~4096-token chunks
Chunks are summarized in parallel via futures::stream::buffer_unordered(4) — up to 4 concurrent LLM calls
Partial summaries are merged into a final summary via a second LLM pass
replace_conversation() atomically updates the compacted range and inserts the summary in SQLite
Last compaction_preserve_tail messages (default: 4) are always preserved

If a single chunk fits all messages, or if chunked summarization fails, the system falls back to a single-pass summarization over the full message range.

Both tiers are idempotent and run automatically during the agent loop.

Post-Compression Validation (Compaction Probe)

After hard-tier LLM compaction produces a candidate summary, an optional validation step can verify that the summary preserves critical facts before committing it. The compaction probe generates factual questions from the original messages, answers them using only the summary, and scores the answers. The probe runs only during hard-tier compaction events — soft-tier pruning and deferred summaries are not validated.

The feature is disabled by default ([memory.compression.probe] enabled = false). On errors or timeouts, the probe fails open — compaction proceeds without validation.

How It Works

After summarize_messages() produces a summary, the probe generates up to max_questions factual questions from the original messages. Tool output bodies are truncated to 500 characters to focus on decisions and outcomes.
Questions target concrete details: file paths, function/struct names, architectural decisions, config values, error messages, and action items.
A second LLM call answers the questions using ONLY the summary text. If information is absent from the summary, the model answers “UNKNOWN”.
Answers are scored against expected values using token-set-ratio similarity (Jaccard-based with substring boost). Refusal patterns (“unknown”, “not mentioned”, “n/a”, etc.) score 0.0.
The average score determines the verdict.

If the probe generates fewer than 2 questions (e.g., very short conversations with insufficient factual content), the probe is skipped and compaction proceeds without validation.

Verdict Behavior

Verdict	Score Range (defaults)	Action	Metric incremented
Pass	>= 0.60	Commit summary	`compaction_probe_passes`
SoftFail	[0.35, 0.60)	Commit summary + WARN log	`compaction_probe_soft_failures`
HardFail	< 0.35	Block compaction, preserve original messages	`compaction_probe_failures`
Error	N/A (LLM/timeout)	Non-blocking, proceed with compaction	`compaction_probe_errors`

When HardFail blocks compaction, the outcome is ProbeRejected. This sets an internal cooldown but does NOT trigger the Exhausted state — the compactor can retry on a later turn with new messages.

User-Facing Messages

During probe: status indicator shows “Validating compaction quality…”
HardFail (via /compact): “Compaction rejected: summary quality below threshold. Original context preserved.”
SoftFail: warning in logs only; user sees normal “Context compacted successfully.”
Pass: normal “Context compacted successfully.”

Configuration

[memory.compression.probe]
enabled = false           # Enable compaction probe validation (default: false)
model = ""                # Model for probe LLM calls (empty = summary provider)
threshold = 0.6           # Minimum score to pass without warnings
hard_fail_threshold = 0.35 # Score below this blocks compaction (HardFail)
max_questions = 3         # Maximum factual questions per probe
timeout_secs = 15         # Timeout for the entire probe (both LLM calls)

Field	Type	Default	Description
`enabled`	boolean	`false`	Enable probe validation after each hard compaction
`model`	string	`""`	Model override for probe LLM calls. Empty = use summary provider. Non-Haiku models increase cost (~10x)
`threshold`	float	`0.6`	Minimum average score for Pass verdict
`hard_fail_threshold`	float	`0.35`	Score below this triggers HardFail (blocks compaction)
`max_questions`	integer	`3`	Number of factual questions generated per probe
`timeout_secs`	integer	`15`	Timeout for both LLM calls combined

Threshold tuning:

Decrease threshold to 0.45-0.50 for creative or conversational sessions where verbatim detail preservation matters less.
Raise threshold to 0.75-0.80 for coding sessions where file paths and architectural decisions must survive compaction.
Keep a gap of at least 0.15-0.20 between hard_fail_threshold and threshold to maintain a meaningful SoftFail range.
max_questions = 3 balances probe accuracy against latency and cost. Increase to 5 for higher statistical power at the expense of slower probes.

Debug Dump Output

When debug dump is enabled, each probe writes a {id:04}-compaction-probe.json file with the full probe result:

{
  "score": 0.75,
  "threshold": 0.6,
  "hard_fail_threshold": 0.35,
  "verdict": "Pass",
  "model": "claude-haiku-4-5-20251001",
  "duration_ms": 2340,
  "questions": [
    {
      "question": "What file was modified to fix the auth bug?",
      "expected": "crates/zeph-core/src/auth.rs",
      "actual": "The file crates/zeph-core/src/auth.rs was modified",
      "score": 1.0
    }
  ]
}

The questions array merges question text, expected answer, actual LLM answer, and per-question score into a single object per question for easy inspection.

Troubleshooting

Frequent HardFail verdicts

The summary model may be too small for the conversation complexity. Try a larger model via model = "claude-sonnet-4-5-20250514" (higher cost).
Lower hard_fail_threshold if false negatives are common (probe is too strict).
Increase max_questions to 5 for more statistical power (increases latency).

Probe always returns SoftFail

Check debug dump: if per-question scores show one strong and one weak answer, the summary may be partially lossy. This is expected behavior — SoftFail means “good enough” and does not block compaction.
Consider enabling Failure-Driven Compression Guidelines to teach the summarizer what to preserve.

Probe timeout warnings

Default 15s should be sufficient for most models. Increase timeout_secs for slow providers (e.g., local Ollama with large models).
On timeout, compaction proceeds without validation (fail-open).

Performance considerations

Each probe makes 2 LLM calls (question generation + answer verification).
With Haiku: ~$0.001-0.003 per probe, 1-3 seconds latency.
With Sonnet: ~$0.01-0.03 per probe, 2-5 seconds latency.
Probes run only during hard compaction events, not on every turn.
The probe timeout does not affect the main agent loop — it only gates whether the compaction summary is committed.

Metrics

Metric	Description
`compaction_probe_passes`	Total Pass verdicts
`compaction_probe_soft_failures`	Total SoftFail verdicts
`compaction_probe_failures`	Total HardFail verdicts (compaction blocked)
`compaction_probe_errors`	Total Error verdicts (LLM/timeout, non-blocking)
`last_probe_verdict`	Most recent verdict (Pass/SoftFail/HardFail/Error)
`last_probe_score`	Most recent probe score in [0.0, 1.0]

Compaction Loop Prevention

maybe_compact() tracks whether compaction is making progress. The compaction_exhausted flag is set permanently when any of the following conditions are detected after a hard-tier attempt:

Fewer than 2 messages are eligible for compaction (nothing useful to summarize).
The LLM summary consumes as many tokens as were freed — net reduction is zero.
Context usage remains above hard_compaction_threshold even after a successful summarization pass.

Once exhausted, all further compaction calls are skipped for the session. A one-time warning is emitted to the user channel and to the log (WARN level):

Warning: context budget is too tight — compaction cannot free enough space.
Consider increasing [memory] context_budget_tokens or starting a new session.

This prevents infinite compaction loops when the configured budget is smaller than the minimum required for the system prompt and response reservation combined.

Structured Anchored Summarization

When hard compaction fires, the summarizer can produce structured AnchoredSummary objects with five mandatory sections:

Section	Content
`session_intent`	What the user is trying to accomplish
`files_modified`	File paths, function names, structs touched
`decisions_made`	Architectural decisions with rationale
`open_questions`	Unresolved items or ambiguities
`next_steps`	Concrete actions to take immediately

Anchored summaries are validated for completeness (session_intent and next_steps must be non-empty) and rendered as Markdown with [anchored summary] headers. This structured format reduces information loss compared to the free-form 9-section prompt below.

Subgoal-Aware Compaction

When task orchestration is active, the SubgoalRegistry tracks which messages belong to each subgoal and their state (Active, Completed, Abandoned). During hard compaction:

Messages in active subgoal ranges are preserved unconditionally
Messages in completed subgoal ranges are aggressively compacted
The registry state is dumped alongside each compaction event when debug dump is enabled ({id:04}-subgoal-registry.txt)

This prevents compaction from destroying the context that an in-progress orchestration task depends on.

Structured Compaction Prompt

Compaction summaries use a 9-section structured prompt designed for self-consumption. The LLM is instructed to produce exactly these sections:

User Intent — what the user is ultimately trying to accomplish
Technical Concepts — key technologies, patterns, constraints discussed
Files & Code — file paths, function names, structs, enums touched or relevant
Errors & Fixes — every error encountered and whether/how it was resolved
Problem Solving — approaches tried, decisions made, alternatives rejected
User Messages — verbatim user requests that are still pending or relevant
Pending Tasks — items explicitly promised or left TODO
Current Work — the exact task in progress at the moment of compaction
Next Step — the single most important action to take immediately after compaction

The prompt favors thoroughness over brevity: longer summaries that preserve actionable detail are preferred over terse ones. When multiple chunks are summarized in parallel, a consolidation pass merges partial summaries into the same 9-section structure.

Progressive Tool Response Removal

When the LLM compaction itself hits a context length error (the messages being compacted are too large for the summarization model), summarize_messages() applies progressive middle-out tool response removal before retrying:

Tier	Fraction removed	Description
1	10%	Remove ~10% of tool responses from the center outward
2	20%	Increase removal to ~20%
3	50%	Remove half of all tool responses
4	100%	Remove all tool responses

The middle-out strategy starts removal from the center of the tool response list and alternates outward toward the edges. This preserves the earliest responses (which establish context) and the most recent ones (which reflect current work), while discarding the middle of the conversation first.

At each tier, ToolResult content is replaced with [compacted] and ToolOutput bodies are cleared (with compacted_at timestamp set). The reduced message set is then retried through the LLM summarization pipeline.

Metadata-Only Fallback

If all LLM summarization attempts fail (including after 100% tool response removal), build_metadata_summary() produces a lightweight summary without any LLM call:

[metadata summary — LLM compaction unavailable]
Messages compacted: 47 (23 user, 22 assistant, 2 system)
Last user message: <first 200 chars of last user message>
Last assistant message: <first 200 chars of last assistant message>

Text previews use safe UTF-8 truncation (truncate_chars()) that never splits a Unicode scalar value. This fallback guarantees that compaction always succeeds, even when the LLM is unreachable or the context is too large for any available model.

Reactive Retry on Context Length Errors

LLM calls in the agent loop (call_llm_with_retry() and call_chat_with_tools_retry()) intercept context length errors and automatically compact before retrying. The flow:

Send messages to the LLM provider
If the provider returns a context length error, trigger compact_context()
Retry the LLM call with the compacted context
If the error persists after max_attempts (default: 2), propagate the error

Non-context-length errors (rate limits, network failures, etc.) are propagated immediately without retry.

Context Length Error Detection

LlmError::is_context_length_error() detects context overflow across providers via pattern matching on error messages:

Provider	Matched patterns
Claude	`"maximum number of tokens"`
OpenAI	`"maximum context length"`, `"context_length_exceeded"`
Ollama	`"context length exceeded"`, `"prompt is too long"`, `"input too long"`

The dedicated LlmError::ContextLengthExceeded variant is also recognized. This unified detection allows the retry logic to work identically across all supported LLM backends.

Dual-Visibility Compaction

Compaction is non-destructive. Each Message carries MessageMetadata with agent_visible and user_visible flags:

Message state	`agent_visible`	`user_visible`	Appears in
Normal	`true`	`true`	LLM context + UI
Compacted original	`false`	`true`	UI only
Compaction summary	`true`	`false`	LLM context only

replace_conversation() performs both updates atomically in a single SQLite transaction: it sets agent_visible=0, compacted_at=<timestamp> on the compacted range, then inserts the summary with agent_visible=1, user_visible=0. This guarantees the user retains full scroll-back history while the LLM sees only the compact summary.

Semantic recall (vector + FTS5) filters by agent_visible=1, so compacted originals are excluded from retrieval. Use load_history_filtered(conversation_id, agent_visible, user_visible) to query messages by visibility.

Native `compress_context` Tool

When the context-compression feature is enabled, Zeph registers a compress_context native tool that the model can invoke explicitly to trigger context compression on demand — without waiting for the automatic threshold-based compaction pipeline to fire.

The tool supports two compression strategies:

Strategy	Behavior
`Reactive`	Apply pending deferred summaries and prune old tool outputs (no LLM call). Equivalent to a soft-tier compaction triggered on demand.
`Autonomous`	Run full LLM-based chunked compaction immediately, regardless of current token usage. The model decides when to invoke this based on its own assessment of context quality.

Autonomous mode uses the compress_provider for the summarization call. Configure it in [memory.compression]:

[memory.compression]
compress_provider = "fast"   # Provider name for autonomous compress_context calls

When compress_provider is unset, the default LLM provider is used. The compress_context tool does not appear in the tool catalog when the context-compression feature is disabled at build time.

Invocation:

The model calls the tool with a strategy parameter:

{ "strategy": "Autonomous" }

After execution, the tool returns a summary of tokens freed and the compaction outcome. The result is visible in the chat panel and in the debug dump.

Tool Output Management

Truncation

Tool outputs exceeding 30,000 characters are automatically truncated using a head+tail split with UTF-8 safe boundaries. Both the first and last ~15K chars are preserved.

Smart Summarization

When tools.summarize_output = true, long tool outputs are sent through the LLM with a prompt that preserves file paths, error messages, and numeric values. On LLM failure, falls back to truncation.

export ZEPH_TOOLS_SUMMARIZE_OUTPUT=true

Skill Prompt Modes

The skills.prompt_mode setting controls how matched skills are rendered in the system prompt:

Mode	Behavior
`full`	Full XML skill bodies with instructions, examples, and references
`compact`	Condensed XML with name, description, and trigger list only (~80% smaller)
`auto` (default)	Selects `compact` when the remaining context budget is below 8192 tokens, `full` otherwise

[skills]
prompt_mode = "auto"  # "full", "compact", or "auto"

compact mode is useful for small context windows or when many skills are active. It preserves enough information for the model to select the right skill while minimizing token consumption.

Progressive Skill Loading

Skills matched by embedding similarity (top-K) are injected with their full body (or compact summary, depending on prompt_mode). Remaining skills are listed in a description-only <other_skills> catalog — giving the model awareness of all capabilities while consuming minimal tokens.

ZEPH.md Project Config

Zeph walks up the directory tree from the current working directory looking for:

ZEPH.md
ZEPH.local.md
.zeph/config.md

Found configs are concatenated (global first, then ancestors from root to cwd) and injected into the system prompt as a <project_context> block. Use this to provide project-specific instructions.

Environment Variables

Variable	Description	Default
`ZEPH_MEMORY_CONTEXT_BUDGET_TOKENS`	Context budget in tokens	`0` (unlimited)
`ZEPH_MEMORY_SOFT_COMPACTION_THRESHOLD`	Soft compaction threshold: prune tool outputs + apply deferred summaries (no LLM)	`0.60`
`ZEPH_MEMORY_COMPACTION_THRESHOLD`	Hard compaction threshold (backward compat alias for `hard_compaction_threshold`)	`0.90`
`ZEPH_MEMORY_COMPACTION_PRESERVE_TAIL`	Messages preserved during compaction	`4`
`ZEPH_MEMORY_PRUNE_PROTECT_TOKENS`	Tokens protected from Tier 1 tool output pruning	`40000`
`ZEPH_MEMORY_CROSS_SESSION_SCORE_THRESHOLD`	Minimum relevance score for cross-session memory results	`0.35`
`ZEPH_MEMORY_TOOL_CALL_CUTOFF`	Max visible tool pairs before oldest is summarized	`6`
`ZEPH_MEMORY_SEMANTIC_TEMPORAL_DECAY_ENABLED`	Enable temporal decay scoring	`false`
`ZEPH_MEMORY_SEMANTIC_TEMPORAL_DECAY_HALF_LIFE_DAYS`	Half-life for temporal decay	`30`
`ZEPH_MEMORY_SEMANTIC_MMR_ENABLED`	Enable MMR re-ranking	`false`
`ZEPH_MEMORY_SEMANTIC_MMR_LAMBDA`	MMR relevance-diversity trade-off	`0.7`
`ZEPH_TOOLS_SUMMARIZE_OUTPUT`	Enable LLM-based tool output summarization	`false`

Keyboard shortcuts

Zeph Documentation