feat: add IncidentWindow foundation for shared incident time#951
Conversation
Greptile SummaryThis PR introduces the
Confidence Score: 4/5One P1 defect — string-form webhook payloads silently fall back to the default window — should be fixed before merging to avoid defeating the feature for a common input shape. The foundation logic in
Important Files Changed
Sequence DiagramsequenceDiagram
participant W as Webhook / Caller
participant N as node_extract_alert
participant E as _enrich_raw_alert
participant R as resolve_incident_window
participant S as AgentState
W->>N: raw_alert (dict or JSON string)
N->>E: raw_alert
Note over E: If raw_alert is a string,<br/>it becomes {} (content lost)
E-->>N: enriched_alert (always dict)
N->>R: enriched_alert ← P1: string payloads arrive empty here
Note over R: _coerce_alert_dict()<br/>_extract_anchor() → first matching parser wins
alt anchor found
R-->>N: IncidentWindow(source=label, confidence=1.0)
else no anchor
R-->>N: IncidentWindow(source=default, confidence=0.0)
end
N->>S: incident_window = window.to_dict()
Reviews (1): Last reviewed commit: "feat: add IncidentWindow foundation for ..." | Re-trigger Greptile |
What was needed:
Time-aware tools each independently default to "last 60 minutes"
counted from the agent's wall clock, not from when the alert
actually started. Slow-burn incidents (alert fires 3h after the
underlying problem began) get queried in the wrong time window
entirely. Different tools also disagree on what window they're
asking about. There is no shared "incident time" anywhere in
AgentState today.
What this PR does:
Pure foundation. No tool behavior changes yet.
New app/incident_window.py with:
(UTC normalisation, since < until, 0 <= confidence <= 1,
rejects naive datetimes, rejects empty source).
and v2 shapes), Datadog (epoch ms / s / ISO), CloudWatch
(top-level and SNS-wrapped Message, depth-capped at 4 levels
to prevent stack overflow on pathologically nested payloads).
Each parser is wrapped in try/except so a misbehaving parser
cannot crash the pipeline.
forward_buffer_minutes, now) with override-always-wins
precedence, anchor lookup, default fallback, clock-skew
protection, lookback clamped to MAX_LOOKBACK_MINUTES (7d),
defensive handling of zero/negative lookback, injectable now
for deterministic tests.
on each default fallback for production debuggability.
AgentState (TypedDict) and AgentStateModel (Pydantic) both gain
incident_window: dict | None. Drift test still passes.
extract_alert/extract_node.py now calls resolve_incident_window
on the enriched raw alert and stores the result via to_dict() in
state.incident_window. Existing extract_alert behavior unchanged
(only adds a new key to the result dict).
What's not in this PR (deferred to follow-ups):
tool (GitDeployTimelineTool) to use it.
Tests (66 new, all pass):
confidence range, non-UTC normalised, empty source rejected,
non-datetime rejected).
of dicts that would violate post_init invariants.
None / malformed JSON).
managed alert, PagerDuty v3, Datadog event_time ms, CloudWatch
SNS-wrapped alarm).
level startsAt, Datadog epoch seconds vs ms vs ISO, PagerDuty v2
nesting, CloudWatch top-level StateUpdatedTimestamp).
negative buffer, naive ISO string, non-dict raw_alert, garbage
list entries, bool event_time not treated as epoch, malformed
nested SNS Message).
without raising.
blow the Python recursion limit.