Skip to content

Adaptive incident window: expand-on-empty-deploy-timeline#1074

Merged
hamzzaaamalik merged 6 commits intoTracer-Cloud:mainfrom
hamzzaaamalik:incident-window-adaptive-expansion
Apr 29, 2026
Merged

Adaptive incident window: expand-on-empty-deploy-timeline#1074
hamzzaaamalik merged 6 commits intoTracer-Cloud:mainfrom
hamzzaaamalik:incident-window-adaptive-expansion

Conversation

@hamzzaaamalik
Copy link
Copy Markdown
Collaborator

PR 3 of the dynamic incident-window work. Builds on #951 and #954 (both merged).

What it does

Inserts a new adapt_window node on the diagnose → plan_actions loop-back edge. When the loop continues, one rule runs:

Widen the incident window when the deploy timeline came back empty for a shared-window query.

Specifically: if get_git_deploy_timeline ran this iteration with window.source == "shared_incident_window" and returned 0 commits, double the lookback (clamped at 7 days) and record the old window in state.incident_window_history. Next iteration sees the wider window.

Capped at 2 expansions per investigation (e.g. 120m → 240m → 480m, then stops). Caller-explicit windows are never overridden. Terminal routing bypasses the node entirely.

Files

New:

  • app/nodes/adapt_window/{rules.py, node.py, __init__.py}
  • tests/nodes/adapt_window/{test_rules.py, test_node.py}
  • tests/pipeline/test_graph_adapt_window_wiring.py

Modified:

  • app/incident_window.pyIncidentWindow.expanded() helper
  • app/state/agent_state.pyincident_window_history field (drift test green)
  • app/investigation_constants.pyMAX_EXPANSIONS = 2
  • app/nodes/__init__.py, app/pipeline/graph.py — wiring

Tests

52 new tests, 1993 passing in the broad sweep, 0 regressions.

Not in this PR

Window contraction on deploy anchor (PR 4), LLM-driven adaptation, migrating other tools to the shared window, diagnose narrative changes.

Foundation for PR 3 (adaptive window). The adapt_window node will call
this when the deploy timeline came back empty for a shared-window query
to widen the lookback for the next investigation iteration.

Semantics:
  - until is preserved (anchor edge does not move)
  - since moves earlier by factor x current lookback
  - new lookback is clamped to MAX_LOOKBACK_MINUTES (7d)
  - source and confidence are preserved
  - factor must be > 1.0 (expansion only; contraction is a separate,
    deferred operation with different semantics)

The fact of expansion is recorded by the caller in
state.incident_window_history (added in the next commit), not on the
window object itself. Keeps the value object minimal and the audit
trail in one place.

Tests cover: default factor=2.0, custom factor, MAX_LOOKBACK_MINUTES
clamp, already-at-cap returns same width, returns new instance (frozen
dataclass not mutated), preserves until anchor, preserves source and
confidence, factor=1.0 and factor<1.0 rejected.
Append-only audit trail of windows replaced by the (incoming) adapt_window
node. Each entry is the OLD window dict at the moment of replacement, plus
``replaced_at`` (ISO-8601) and ``replaced_reason`` (e.g.
"expanded:empty_deploy_timeline"). The cap on entries lives in the
adapt_window rule layer (MAX_EXPANSIONS) — the field itself is unbounded
so future rules can record contractions, anchor refinements, etc.

None until the first replacement. Diagnose narratives may cite this in a
future PR to explain "we tried 120m, found no deploys, widened to 240m".

Added in both AgentState (TypedDict) and AgentStateModel (Pydantic) so
the existing drift test in test_agent_state_sync.py stays green.

No node writes to this field yet — wiring lands in commit 4.
Pure decision logic (no LangChain imports — testable in isolation) for
the upcoming adapt_window node. The rule widens state.incident_window
when the GitDeployTimelineTool came back empty for a shared-window
query, so the next investigation iteration looks further back.

Guard chain (rule no-ops unless every guard passes):

  1. state.incident_window is well-formed (IncidentWindow.from_dict).
  2. state.incident_window_history has fewer than MAX_EXPANSIONS (= 2)
     entries — bounded so a pathological loop cannot run away.
  3. STALE-SIGNAL GUARD: the deploy timeline action must be in the most
     recent state.executed_hypotheses[-1].actions. Without this, evidence
     from iteration 1 would re-fire the rule at the end of iteration 2
     even when the tool didn't run again (caught in plan review).
  4. evidence.git_deploy_timeline_window.source == "shared_incident_window".
     caller_explicit / tool_default / unset (== {}) all fall through, so
     a caller's explicit window is NEVER overridden.
  5. evidence.git_deploy_timeline_count == 0.
  6. Expansion would actually widen the window (i.e. not already at
     MAX_LOOKBACK_MINUTES — IncidentWindow.expanded() would return same
     width).

When fired, the rule emits a state delta with the new window plus the
OLD window appended to history with replaced_at + replaced_reason. Tests
inject now_fn for deterministic ISO timestamps.

30 tests covering happy path, every guard short-circuit, the
stale-signal guard specifically, multi-iteration accumulation, the
MAX_EXPANSIONS cap, and defensive shape handling (malformed evidence,
malformed history, malformed executed_hypotheses, immutable state).

The node entry point lands in the next commit; this commit only adds
the pure logic and the constant.
Thin @Traceable wrapper around adapt_incident_window from rules.py.
Mirrors the existing convention: node_X function in node.py, package
__init__.py re-exports the entry point, top-level app.nodes also
re-exports for graph builder convenience.

The node:
  - calls the pure rule with a copy of state (no mutation)
  - returns {} on no-op (LangGraph's "no state change" signal)
  - returns the rule's state delta when an expansion fires
  - logs at INFO + debug_print when an expansion happens so operators
    can audit which run widened the window and why

Tests cover: state delta passthrough, no-op contract, completely empty
state (early pipeline), state immutability, INFO log on fire, no INFO
log on no-op, RunnableConfig kwarg accepted-and-ignored.

The graph wiring lands in the next commit; nothing currently calls this
node, so adding it does not affect any existing investigation.
Wire node_adapt_window into the LangGraph between diagnose and the
loop-back to plan_actions. Terminal routing decisions (opensre_eval,
publish) bypass it entirely — adaptation only runs when the
investigation is continuing.

Topology change:
  before: diagnose --[conditional "investigate"]--> plan_actions
  after:  diagnose --[conditional "investigate"]--> adapt_window
                                                      |
                                                      v
                                                   plan_actions

The string "investigate" returned by route_investigation_loop is
preserved (it means "loop again", not "go to the investigate node");
only the destination node mapping changes. No routing.py change needed.

6 graph-introspection smoke tests assert: node registered, loop edge
present, unconditional adapt_window->plan_actions edge present,
terminal paths bypass adapt_window, no direct diagnose->plan_actions
edge remains (regression guard), pre-loop edges unchanged.
@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps Bot commented Apr 29, 2026

Greptile Summary

This PR inserts a new adapt_window node on the diagnose → plan_actions loop-back edge. When looping continues, the node applies a pure rule that doubles the incident lookback window (capped at 7 days, max 2 expansions) when get_git_deploy_timeline ran in the most recent iteration against a shared window and returned 0 commits. Terminal paths bypass the node entirely.

The implementation is well-structured: the 8-guard rule chain is side-effect-free, the stale-signal guard (executed_hypotheses[-1].actions) correctly prevents re-firing across iterations when the deploy tool didn't run again, and the full-list replacement approach is correct for LangGraph's default reducer on incident_window_history.

Confidence Score: 5/5

Safe to merge — all remaining findings are P2 style/test quality nits with no impact on runtime behaviour.

Logic is sound, guard chain is exhaustive, state mutations are non-destructive, and 52 tests cover happy-path, each guard independently, multi-expansion, and defensive shapes. The two P2 findings (stale docstring, dead parametrized test cases) have zero effect on correctness or reliability.

No files require special attention.

Important Files Changed

Filename Overview
app/nodes/adapt_window/rules.py Pure rule function with a well-structured 8-guard chain. Stale-signal guard correctly prevents re-fire across iterations. History rebuilding returns the full list for LangGraph's replace reducer.
app/nodes/adapt_window/node.py Thin LangGraph wrapper around the pure rule; logs INFO on expansion, returns rule's delta directly. @Traceable, dict(state) shallow copy is safe since the rule is read-only.
app/nodes/adapt_window/init.py Package init — contains a stale docstring saying node.py is 'added in the next commit' when it is already present in this PR.
app/pipeline/graph.py Correctly re-routes the 'investigate' conditional edge through adapt_window before plan_actions. Terminal paths bypass the node. Wiring tests confirm the contract.
app/state/agent_state.py Adds incident_window_history to both AgentState and AgentStateModel, keeping them in sync. Default-replace reducer is appropriate since the rule returns the full accumulated list.
app/incident_window.py Adds expanded() method — correctly widens lookback by factor, clamps to MAX_LOOKBACK_MINUTES, preserves until/source/confidence, raises on factor ≤ 1.0.
tests/nodes/adapt_window/test_rules.py Thorough guard coverage, stale-signal class, multi-expansion tests, and defensive shape tests. One parametrized test has dead cases for factor≠EXPANSION_FACTOR that assert nothing.
tests/nodes/adapt_window/test_node.py Good coverage of the wrapper contract: delta pass-through, no-op path, empty-state safety, non-mutation, INFO-log assertions, config kwarg acceptance.
tests/pipeline/test_graph_adapt_window_wiring.py Graph-introspection tests verify the conditional edge path-map and unconditional adapt_window → plan_actions edge. Includes regression guard against direct diagnose → plan_actions edge.

Sequence Diagram

sequenceDiagram
    participant D as diagnose
    participant R as route_investigation_loop
    participant AW as adapt_window
    participant PA as plan_actions
    participant E as opensre_eval / publish

    D->>R: conditional edge
    alt route == investigate
        R->>AW: loop-back path
        AW->>AW: adapt_incident_window(state)
        note over AW: guard chain: window present? history < MAX_EXPANSIONS? deploy timeline ran this iter? evidence present? source == shared_incident_window? commits_count == 0? expansion actually widens?
        alt all guards pass
            AW-->>PA: {incident_window: widened, incident_window_history: [...old]}
        else any guard fails
            AW-->>PA: {} (no state change)
        end
    else route == opensre_eval or publish
        R->>E: terminal path (bypasses adapt_window)
    end
Loading

Comments Outside Diff (1)

  1. tests/nodes/adapt_window/test_rules.py, line 970-982 (link)

    P2 Dead parametrized test cases assert nothing

    For factor=3.0 and factor=4.0 the if factor == EXPANSION_FACTOR guard is always False, so those two parametrized cases execute zero assertions and pass trivially. They inflate the test count without providing any coverage.

    Since the actual intent is to pin the constant, a single direct assertion would be cleaner:

    def test_expansion_factor_is_two() -> None:
        """Pin EXPANSION_FACTOR so changes are caught immediately."""
        from app.nodes.adapt_window.rules import EXPANSION_FACTOR
        assert EXPANSION_FACTOR == 2.0

Reviews (1): Last reviewed commit: "feat(pipeline): route investigate loop t..." | Re-trigger Greptile

Comment thread app/nodes/adapt_window/__init__.py
@hamzzaaamalik hamzzaaamalik merged commit 142a7b8 into Tracer-Cloud:main Apr 29, 2026
7 checks passed
@github-actions
Copy link
Copy Markdown
Contributor

🧑‍💻 @hamzzaaamalik has entered the contributor hall of fame. Merged. Done. Shipped. Go touch grass (then come back with another PR). 🌱


👋 Join us on Discord - OpenSRE : hang out, contribute, or hunt for features and issues. Everyone's welcome.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant