Skip to content

feat: Trust-system Arc 4 (first wave) — rocky trace + feature-gated OTLP metrics#173

Merged
hugocorreia90 merged 1 commit intomainfrom
feat/arc-4-observability
Apr 20, 2026
Merged

feat: Trust-system Arc 4 (first wave) — rocky trace + feature-gated OTLP metrics#173
hugocorreia90 merged 1 commit intomainfrom
feat/arc-4-observability

Conversation

@hugocorreia90
Copy link
Copy Markdown
Contributor

Summary

First wave of Arc 4 from the trust-system direction. Lead primitive is user-facing (rocky trace); OTel metrics export is the SRE-side complement.

  • rocky trace <run_id|latest> [--model <name>] renders a completed run as a Gantt-style timeline from the state store's RunRecord. Per-model entries carry start_offset_ms relative to run start plus a greedy first-fit lane index so concurrent materializations display on separate rows. JSON output is a new TraceOutput schema; human output is an ASCII timeline with [####....] duration bars. Sibling to rocky replay — same inputs, different lens: replay is the reproducibility artefact (SQL hashes, row counts, config hash); trace is the observability artefact (offsets, lanes, durations).
  • Feature-gated otel OTLP metrics export. rocky-cli and rocky grow an otel feature that cascades to the existing (but previously unwired) rocky_observe::otel::OtelExporter. New OtelGuard RAII wrapper auto-initialises when OTEL_EXPORTER_OTLP_ENDPOINT is set and flushes + shuts down on drop — covers every exit path (happy, interrupted, error) without explicit cleanup. Off by default; default builds don't pull the OTLP dependency graph.

Deferred to later Arc 4 waves

  • OTel span coverage (wrap HookEvent::Before/AfterMaterialize sites with a tracer provider so every model execution becomes a trace span).
  • Freshness SLO enforcement — [freshness.sla] block in rocky.tomlsla_breach PipelineEvent + HookEvent::SlaBreach.
  • Dagster UI timeline hook (consume TraceOutput on the dagster-rocky side to render per-asset Gantt).

Test plan

  • cargo test -p rocky-cli commands::trace — 7 new tests covering lane assignment (sequential + overlapping), "latest" resolution, empty store, missing run, row rendering, truncation.
  • cargo test --workspace — 976 rocky-core + 168 rocky-cli + adapter suites green. Default build (no otel feature).
  • cargo build -p rocky --features otel — verifies the cross-crate feature cascade builds clean. otel path compiles but is not exercised in CI; operators opt in by setting OTEL_EXPORTER_OTLP_ENDPOINT at runtime.
  • cargo clippy --workspace --all-targets -- -D warnings clean.
  • cargo fmt --check clean.
  • just codegen — new trace.schema.json + Pydantic + TypeScript regenerated.
  • just regen-fixtures — RunOutput unchanged, fixture corpus stable.
  • uv run pytest in integrations/dagster/ — 307 tests pass.

🤖 Generated with Claude Code

…TLP metrics export

- `rocky trace <run_id|latest> [--model <name>]` — new CLI command
  that renders a completed run as a Gantt-style timeline from the
  state store's `RunRecord`. Per-model entries carry
  `start_offset_ms` relative to run start plus a greedy first-fit
  `lane` for concurrent models. `TraceOutput` ships with Pydantic +
  TypeScript bindings so Dagster and custom dashboards can draw the
  timeline without re-deriving the base timestamp. Sibling to
  `rocky replay`: replay is the reproducibility artefact (SQL hashes,
  row counts); trace is the observability artefact (offsets, lanes,
  durations).
- Feature-gated `otel` OTLP metrics export. `rocky-cli` and `rocky`
  grow an `otel` feature that cascades to the existing (but
  previously unwired) `rocky_observe::otel::OtelExporter`. New
  `OtelGuard` RAII wrapper auto-initialises when
  `OTEL_EXPORTER_OTLP_ENDPOINT` is set and flushes + shuts down on
  drop, so every `rocky run` exit path (happy, interrupted, error)
  covers the collector without an explicit cleanup call. Off by
  default — builds drop the OTel dependency graph entirely.

OTel span coverage, freshness SLO enforcement, and the Dagster UI
timeline hook are deferred to later Arc 4 waves.
@hugocorreia90 hugocorreia90 merged commit a6fb919 into main Apr 20, 2026
8 checks passed
@hugocorreia90 hugocorreia90 deleted the feat/arc-4-observability branch April 20, 2026 10:55
@hugocorreia90 hugocorreia90 mentioned this pull request Apr 20, 2026
3 tasks
hugocorreia90 added a commit that referenced this pull request Apr 20, 2026
Closes the first wave of every trust-system arc (Arcs 1-7) plus the two
wave-2 follow-ups landed the same day. Nine feature PRs since v1.10.0.

- Arc 1 (#170): rocky lineage --downstream, rocky branch, rocky run --branch, rocky replay
- Arc 2 (#171): per-run cost attribution, [budget] block, budget_breach hook
- Arc 3 (#172): three-state CircuitBreaker, adapter consolidation
- Arc 4 (#173): rocky trace Gantt + feature-gated OTLP metrics export
- Arc 5 (#174): schema-grounded rocky ai prompt + project-aware validator
- Arc 6 wave 1 (#184): --target-dialect P001 portability lint (12 constructs)
- Arc 7 wave 1 (#185): blast-radius P002 SELECT * lint (semantic-graph aware)
- Arc 6 wave 2 (#186): [portability] config block + per-model rocky-allow pragma
- Arc 7 wave 2 wave-1 (#187): --with-seed source-schema inference

Plus #169 fix: install scripts pick latest engine version by semver.

Version bump: 20 Cargo.toml files (all workspace members except
rocky-bigquery, which tracks its own version).

Wave 2/3 work for every arc remains in the deferred backlog — see
the changelog Deferred section for the full carry-forward.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant