feat: add IncidentWindow foundation for shared incident time by hamzzaaamalik · Pull Request #951 · Tracer-Cloud/opensre

hamzzaaamalik · 2026-04-25T16:15:06Z

What was needed:
Time-aware tools each independently default to "last 60 minutes"
counted from the agent's wall clock, not from when the alert
actually started. Slow-burn incidents (alert fires 3h after the
underlying problem began) get queried in the wrong time window
entirely. Different tools also disagree on what window they're
asking about. There is no shared "incident time" anywhere in
AgentState today.

What this PR does:
Pure foundation. No tool behavior changes yet.

New app/incident_window.py with:
- IncidentWindow frozen dataclass with post_init validation
  (UTC normalisation, since < until, 0 <= confidence <= 1,
  rejects naive datetimes, rejects empty source).
- to_dict / from_dict serialisation with _schema_version=1.
- Five anchor parsers for Alertmanager, Grafana, PagerDuty (v3
  and v2 shapes), Datadog (epoch ms / s / ISO), CloudWatch
  (top-level and SNS-wrapped Message, depth-capped at 4 levels
  to prevent stack overflow on pathologically nested payloads).
  Each parser is wrapped in try/except so a misbehaving parser
  cannot crash the pipeline.
- resolve_incident_window(raw_alert, *, override, lookback_minutes,
  forward_buffer_minutes, now) with override-always-wins
  precedence, anchor lookup, default fallback, clock-skew
  protection, lookback clamped to MAX_LOOKBACK_MINUTES (7d),
  defensive handling of zero/negative lookback, injectable now
  for deterministic tests.
- Structured INFO log on each anchored resolution and DEBUG log
  on each default fallback for production debuggability.
AgentState (TypedDict) and AgentStateModel (Pydantic) both gain
incident_window: dict | None. Drift test still passes.
extract_alert/extract_node.py now calls resolve_incident_window
on the enriched raw alert and stores the result via to_dict() in
state.incident_window. Existing extract_alert behavior unchanged
(only adds a new key to the result dict).

What's not in this PR (deferred to follow-ups):

No tool reads from state.incident_window yet. PR 2 wires the first
tool (GitDeployTimelineTool) to use it.
No adaptive expansion / narrowing logic. PR 3 adds that.

Tests (66 new, all pass):

Construction validation (naive tz rejected, inverted rejected,
confidence range, non-UTC normalised, empty source rejected,
non-datetime rejected).
Round-trip serialisation including schema version and rejection
of dicts that would violate post_init invariants.
Resolver precedence (override wins, default fallback, raw_alert
None / malformed JSON).
One verbatim webhook payload per format (Alertmanager v4, Grafana
managed alert, PagerDuty v3, Datadog event_time ms, CloudWatch
SNS-wrapped alarm).
Parser shape variants (multiple AM alerts -> earliest wins, top
level startsAt, Datadog epoch seconds vs ms vs ISO, PagerDuty v2
nesting, CloudWatch top-level StateUpdatedTimestamp).
Edge cases (clock skew, MAX clamp, zero/negative lookback,
negative buffer, naive ISO string, non-dict raw_alert, garbage
list entries, bool event_time not treated as epoch, malformed
nested SNS Message).
Property fuzz: 25 garbage inputs all return a valid IncidentWindow
without raising.
Targeted regression: 200-level deep CloudWatch payload does not
blow the Python recursion limit.

greptile-apps · 2026-04-25T16:18:45Z

Greptile Summary

This PR introduces the IncidentWindow foundation — a frozen dataclass, five alert-format anchor parsers, and a resolve_incident_window resolver — so that time-aware tools can share a single, alert-anchored time window instead of each defaulting to "last 60 minutes from wall clock." State fields are added to both AgentState and AgentStateModel, and the extract_alert node is wired to populate incident_window on every run.

P1 — string raw_alert payloads silently fall back to default: _enrich_raw_alert discards the content of any non-dict raw_alert (replacing a string with {}), so resolve_incident_window(enriched_alert) never sees timestamps from JSON-string webhook payloads. The fix is to pass the original raw_alert (before enrichment) which _coerce_alert_dict already handles correctly.

Confidence Score: 4/5

One P1 defect — string-form webhook payloads silently fall back to the default window — should be fixed before merging to avoid defeating the feature for a common input shape.

The foundation logic in incident_window.py is sound and well-tested. The P1 bug is limited to extract_node.py line 143 where enriched_alert is passed instead of the pre-enrichment raw_alert; the one-line fix is straightforward. No data loss or security risk, but the core feature silently no-ops for string payloads, which undermines the PR's stated goal.

app/nodes/extract_alert/extract_node.py — the resolve_incident_window call at line 143 needs to receive the original raw_alert rather than enriched_alert.

Important Files Changed

Filename	Overview
app/incident_window.py	New foundation module: frozen dataclass, five anchor parsers, and resolver. Well-structured with defensive error handling and good test coverage. Minor issue: `_grafana_anchor` duplicates `_alertmanager_anchor` and is unreachable in the parser chain.
app/nodes/extract_alert/extract_node.py	Wires `resolve_incident_window` into the extract step. P1 bug: passes `enriched_alert` (which converts any string `raw_alert` to `{}`) instead of the original `raw_alert`, silently dropping all timestamps for string-form webhook payloads.
app/state/agent_state.py	Adds `incident_window: dict[str, Any]
tests/app/test_incident_window.py	66 tests covering construction validation, round-trip serialisation, resolver precedence, real-world payloads, shape variants, and edge cases. Comprehensive and well-organised. Does not test the `extract_node` integration path with string `raw_alert`.
tests/app/test_incident_window_cloudwatch_depth.py	Targeted regression for the CloudWatch 200-level depth cap. Verifies both legitimate 2-level SNS nesting and pathological recursion protection. Clean and focused.

Sequence Diagram

sequenceDiagram
    participant W as Webhook / Caller
    participant N as node_extract_alert
    participant E as _enrich_raw_alert
    participant R as resolve_incident_window
    participant S as AgentState

    W->>N: raw_alert (dict or JSON string)
    N->>E: raw_alert
    Note over E: If raw_alert is a string,<br/>it becomes {} (content lost)
    E-->>N: enriched_alert (always dict)
    N->>R: enriched_alert  ← P1: string payloads arrive empty here
    Note over R: _coerce_alert_dict()<br/>_extract_anchor() → first matching parser wins
    alt anchor found
        R-->>N: IncidentWindow(source=label, confidence=1.0)
    else no anchor
        R-->>N: IncidentWindow(source=default, confidence=0.0)
    end
    N->>S: incident_window = window.to_dict()

_{Reviews (1): Last reviewed commit: "feat: add IncidentWindow foundation for ..." | Re-trigger Greptile}

feat: add IncidentWindow foundation for shared incident time

1dd96a2

greptile-apps Bot reviewed Apr 25, 2026

View reviewed changes

Comment thread app/nodes/extract_alert/extract_node.py Outdated

Comment thread app/incident_window.py Outdated

fix(incident_window): typecheck + format CI failures

2ef8112

hamzzaaamalik merged commit fb7f7c2 into Tracer-Cloud:main Apr 27, 2026
7 checks passed

hamzzaaamalik mentioned this pull request Apr 29, 2026

Adaptive incident window: expand-on-empty-deploy-timeline #1074

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: add IncidentWindow foundation for shared incident time#951

feat: add IncidentWindow foundation for shared incident time#951
hamzzaaamalik merged 2 commits intoTracer-Cloud:mainfrom
hamzzaaamalik:incident-window-foundation

hamzzaaamalik commented Apr 25, 2026

Uh oh!

greptile-apps Bot commented Apr 25, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

hamzzaaamalik commented Apr 25, 2026

Uh oh!

greptile-apps Bot commented Apr 25, 2026

Greptile Summary

Confidence Score: 4/5

Important Files Changed

Sequence Diagram

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant