Skip to content

feat(debug): structure debug dumps as OpenTelemetry-compatible traces#1797

Merged
bug-ops merged 4 commits intomainfrom
structure-debug-dumps-as-opent
Mar 14, 2026
Merged

feat(debug): structure debug dumps as OpenTelemetry-compatible traces#1797
bug-ops merged 4 commits intomainfrom
structure-debug-dumps-as-opent

Conversation

@bug-ops
Copy link
Copy Markdown
Owner

@bug-ops bug-ops commented Mar 14, 2026

Summary

Implements #1343 — extends debug dumps to emit structured OpenTelemetry-compatible OTLP JSON traces.

  • Span hierarchy: session → iteration → LLM request / tool call / memory search / skill match with full parent-child relationship tracking
  • New DumpFormat::Trace: when --dump-format trace is set, legacy numbered files are NOT written; when otel feature is enabled, spans are forwarded via mpsc channel to the OTLP subscriber
  • [debug.traces] config section: otlp_endpoint, service_name (default: "zeph"), redact (default: true), max_spans (default: 10000)
  • Security: trace.json written with 0o600 permissions on Unix; all text attributes pass through existing Redactor; max_spans cap prevents memory exhaustion in long sessions; error_kind in tool spans is redacted
  • Runtime switch: /dump-format <json|raw|trace> TUI/CLI command creates/destroys TracingCollector on the fly
  • Integration points: --dump-format CLI flag, --init wizard step, --migrate-config auto-migration for existing [debug] configs
  • OTLP JSON output written to {dump_dir}/{session_id}/trace.json (session-isolated, no concurrent overwrites)

Test plan

  • cargo nextest run --workspace --features full --lib --bins — 5684 tests pass
  • cargo clippy --workspace --features full -- -D warnings — clean
  • cargo +nightly fmt --check — clean
  • Verify --dump-format trace produces valid OTLP JSON with session → iteration → tool/llm span hierarchy
  • Verify --dump-format raw still produces legacy numbered files (backward compat)
  • Verify trace.json has 0o600 permissions on Unix
  • File follow-up issue for OBS-02 (native tool span start timestamp is post-execution due to post-hoc assembly pattern)

bug-ops added 3 commits March 14, 2026 23:30
Add DumpFormat::Trace variant that emits OTLP-compatible JSON instead of
numbered dump files. TracingCollector captures session/iteration/LLM/tool/
memory spans with full hierarchy, redacts secrets in all text attributes
(C-01), uses owned SpanGuard for async-safe begin/end calls (C-02), is a
no-op through DebugDumper when Trace format is active (C-03), and flushes
partial traces on Drop for error/panic/cancellation paths (C-04). An optional
mpsc channel forwards completed spans to the otel feature's OTLP exporter
(C-05). Concurrent iterations are tracked via HashMap<usize, IterationEntry>
(I-03). OTLP JSON encoding follows the Protobuf JSON spec with string int64
timestamps (I-04).

Integration points: --dump-format CLI flag, /dump-format slash command, TUI
command dispatch, --init wizard step, config [debug.traces] section with
otlp_endpoint/service_name/redact fields, config default.toml update.
CR-01: wire begin/end_llm_request and begin/end_tool_call at actual call
sites. Store current_iteration_span_id in DebugState so both legacy and
native execution paths can attach child spans without parameter threading.
Introduce execute_tool_with_trace helper in legacy.rs to stay within the
100-line function limit.

CR-02: replace std::fs::write with write_trace_file helper using
OpenOptions + mode(0o600) on Unix (SEC-01).

CR-03: add max_spans field (default 10000) and push_span() helper that
drops the oldest span when the cap is reached (SEC-02).

CR-04: handle_dump_format_command now creates a fresh TracingCollector
when switching TO trace format, and flushes/drops it when switching AWAY.
Store dump_dir/trace_service_name/trace_redact in DebugState via
with_trace_config builder. Wire with_trace_config in runner.rs.

IMP-01: TracingCollector::new already writes to its own output_dir which
is a timestamped subdir created by DebugDumper::new — no change needed.

IMP-04: apply maybe_redact() to error_kind in end_tool_call.

CR-05 / test gaps: add tool_call_span_emitted, tool_call_error_span_emitted,
session_to_iteration_parent_span_id tests to trace.rs; add
dump_format_from_str_valid and dump_format_from_str_invalid_returns_error
tests to mod.rs.
@github-actions github-actions bot added documentation Improvements or additions to documentation rust Rust code changes core zeph-core crate config Configuration file changes enhancement New feature or request labels Mar 14, 2026
@bug-ops bug-ops enabled auto-merge (squash) March 14, 2026 23:29
@github-actions github-actions bot added the size/XL Extra large PR (500+ lines) label Mar 14, 2026
call_chat_with_tools exceeded the 100-line limit (101 lines) after the
OTel trace instrumentation was added. Extract the debug dump response
writing block into a dedicated helper method to bring it under the limit.
@bug-ops bug-ops merged commit cfa9e79 into main Mar 14, 2026
15 checks passed
@bug-ops bug-ops deleted the structure-debug-dumps-as-opent branch March 14, 2026 23:44
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

config Configuration file changes core zeph-core crate documentation Improvements or additions to documentation enhancement New feature or request rust Rust code changes size/XL Extra large PR (500+ lines)

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant