Skip to content

feat: add IncidentWindow foundation for shared incident time#951

Merged
hamzzaaamalik merged 2 commits intoTracer-Cloud:mainfrom
hamzzaaamalik:incident-window-foundation
Apr 27, 2026
Merged

feat: add IncidentWindow foundation for shared incident time#951
hamzzaaamalik merged 2 commits intoTracer-Cloud:mainfrom
hamzzaaamalik:incident-window-foundation

Conversation

@hamzzaaamalik
Copy link
Copy Markdown
Collaborator

What was needed:
Time-aware tools each independently default to "last 60 minutes"
counted from the agent's wall clock, not from when the alert
actually started. Slow-burn incidents (alert fires 3h after the
underlying problem began) get queried in the wrong time window
entirely. Different tools also disagree on what window they're
asking about. There is no shared "incident time" anywhere in
AgentState today.

What this PR does:
Pure foundation. No tool behavior changes yet.

  • New app/incident_window.py with:

    • IncidentWindow frozen dataclass with post_init validation
      (UTC normalisation, since < until, 0 <= confidence <= 1,
      rejects naive datetimes, rejects empty source).
    • to_dict / from_dict serialisation with _schema_version=1.
    • Five anchor parsers for Alertmanager, Grafana, PagerDuty (v3
      and v2 shapes), Datadog (epoch ms / s / ISO), CloudWatch
      (top-level and SNS-wrapped Message, depth-capped at 4 levels
      to prevent stack overflow on pathologically nested payloads).
      Each parser is wrapped in try/except so a misbehaving parser
      cannot crash the pipeline.
    • resolve_incident_window(raw_alert, *, override, lookback_minutes,
      forward_buffer_minutes, now) with override-always-wins
      precedence, anchor lookup, default fallback, clock-skew
      protection, lookback clamped to MAX_LOOKBACK_MINUTES (7d),
      defensive handling of zero/negative lookback, injectable now
      for deterministic tests.
    • Structured INFO log on each anchored resolution and DEBUG log
      on each default fallback for production debuggability.
  • AgentState (TypedDict) and AgentStateModel (Pydantic) both gain
    incident_window: dict | None. Drift test still passes.

  • extract_alert/extract_node.py now calls resolve_incident_window
    on the enriched raw alert and stores the result via to_dict() in
    state.incident_window. Existing extract_alert behavior unchanged
    (only adds a new key to the result dict).

What's not in this PR (deferred to follow-ups):

  • No tool reads from state.incident_window yet. PR 2 wires the first
    tool (GitDeployTimelineTool) to use it.
  • No adaptive expansion / narrowing logic. PR 3 adds that.

Tests (66 new, all pass):

  • Construction validation (naive tz rejected, inverted rejected,
    confidence range, non-UTC normalised, empty source rejected,
    non-datetime rejected).
  • Round-trip serialisation including schema version and rejection
    of dicts that would violate post_init invariants.
  • Resolver precedence (override wins, default fallback, raw_alert
    None / malformed JSON).
  • One verbatim webhook payload per format (Alertmanager v4, Grafana
    managed alert, PagerDuty v3, Datadog event_time ms, CloudWatch
    SNS-wrapped alarm).
  • Parser shape variants (multiple AM alerts -> earliest wins, top
    level startsAt, Datadog epoch seconds vs ms vs ISO, PagerDuty v2
    nesting, CloudWatch top-level StateUpdatedTimestamp).
  • Edge cases (clock skew, MAX clamp, zero/negative lookback,
    negative buffer, naive ISO string, non-dict raw_alert, garbage
    list entries, bool event_time not treated as epoch, malformed
    nested SNS Message).
  • Property fuzz: 25 garbage inputs all return a valid IncidentWindow
    without raising.
  • Targeted regression: 200-level deep CloudWatch payload does not
    blow the Python recursion limit.

@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps Bot commented Apr 25, 2026

Greptile Summary

This PR introduces the IncidentWindow foundation — a frozen dataclass, five alert-format anchor parsers, and a resolve_incident_window resolver — so that time-aware tools can share a single, alert-anchored time window instead of each defaulting to "last 60 minutes from wall clock." State fields are added to both AgentState and AgentStateModel, and the extract_alert node is wired to populate incident_window on every run.

  • P1 — string raw_alert payloads silently fall back to default: _enrich_raw_alert discards the content of any non-dict raw_alert (replacing a string with {}), so resolve_incident_window(enriched_alert) never sees timestamps from JSON-string webhook payloads. The fix is to pass the original raw_alert (before enrichment) which _coerce_alert_dict already handles correctly.

Confidence Score: 4/5

One P1 defect — string-form webhook payloads silently fall back to the default window — should be fixed before merging to avoid defeating the feature for a common input shape.

The foundation logic in incident_window.py is sound and well-tested. The P1 bug is limited to extract_node.py line 143 where enriched_alert is passed instead of the pre-enrichment raw_alert; the one-line fix is straightforward. No data loss or security risk, but the core feature silently no-ops for string payloads, which undermines the PR's stated goal.

app/nodes/extract_alert/extract_node.py — the resolve_incident_window call at line 143 needs to receive the original raw_alert rather than enriched_alert.

Important Files Changed

Filename Overview
app/incident_window.py New foundation module: frozen dataclass, five anchor parsers, and resolver. Well-structured with defensive error handling and good test coverage. Minor issue: _grafana_anchor duplicates _alertmanager_anchor and is unreachable in the parser chain.
app/nodes/extract_alert/extract_node.py Wires resolve_incident_window into the extract step. P1 bug: passes enriched_alert (which converts any string raw_alert to {}) instead of the original raw_alert, silently dropping all timestamps for string-form webhook payloads.
app/state/agent_state.py Adds `incident_window: dict[str, Any]
tests/app/test_incident_window.py 66 tests covering construction validation, round-trip serialisation, resolver precedence, real-world payloads, shape variants, and edge cases. Comprehensive and well-organised. Does not test the extract_node integration path with string raw_alert.
tests/app/test_incident_window_cloudwatch_depth.py Targeted regression for the CloudWatch 200-level depth cap. Verifies both legitimate 2-level SNS nesting and pathological recursion protection. Clean and focused.

Sequence Diagram

sequenceDiagram
    participant W as Webhook / Caller
    participant N as node_extract_alert
    participant E as _enrich_raw_alert
    participant R as resolve_incident_window
    participant S as AgentState

    W->>N: raw_alert (dict or JSON string)
    N->>E: raw_alert
    Note over E: If raw_alert is a string,<br/>it becomes {} (content lost)
    E-->>N: enriched_alert (always dict)
    N->>R: enriched_alert  ← P1: string payloads arrive empty here
    Note over R: _coerce_alert_dict()<br/>_extract_anchor() → first matching parser wins
    alt anchor found
        R-->>N: IncidentWindow(source=label, confidence=1.0)
    else no anchor
        R-->>N: IncidentWindow(source=default, confidence=0.0)
    end
    N->>S: incident_window = window.to_dict()
Loading

Reviews (1): Last reviewed commit: "feat: add IncidentWindow foundation for ..." | Re-trigger Greptile

Comment thread app/nodes/extract_alert/extract_node.py Outdated
Comment thread app/incident_window.py Outdated
@hamzzaaamalik hamzzaaamalik merged commit fb7f7c2 into Tracer-Cloud:main Apr 27, 2026
7 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant