Adaptive incident window: expand-on-empty-deploy-timeline#1074
Conversation
Foundation for PR 3 (adaptive window). The adapt_window node will call
this when the deploy timeline came back empty for a shared-window query
to widen the lookback for the next investigation iteration.
Semantics:
- until is preserved (anchor edge does not move)
- since moves earlier by factor x current lookback
- new lookback is clamped to MAX_LOOKBACK_MINUTES (7d)
- source and confidence are preserved
- factor must be > 1.0 (expansion only; contraction is a separate,
deferred operation with different semantics)
The fact of expansion is recorded by the caller in
state.incident_window_history (added in the next commit), not on the
window object itself. Keeps the value object minimal and the audit
trail in one place.
Tests cover: default factor=2.0, custom factor, MAX_LOOKBACK_MINUTES
clamp, already-at-cap returns same width, returns new instance (frozen
dataclass not mutated), preserves until anchor, preserves source and
confidence, factor=1.0 and factor<1.0 rejected.
Append-only audit trail of windows replaced by the (incoming) adapt_window node. Each entry is the OLD window dict at the moment of replacement, plus ``replaced_at`` (ISO-8601) and ``replaced_reason`` (e.g. "expanded:empty_deploy_timeline"). The cap on entries lives in the adapt_window rule layer (MAX_EXPANSIONS) — the field itself is unbounded so future rules can record contractions, anchor refinements, etc. None until the first replacement. Diagnose narratives may cite this in a future PR to explain "we tried 120m, found no deploys, widened to 240m". Added in both AgentState (TypedDict) and AgentStateModel (Pydantic) so the existing drift test in test_agent_state_sync.py stays green. No node writes to this field yet — wiring lands in commit 4.
Pure decision logic (no LangChain imports — testable in isolation) for
the upcoming adapt_window node. The rule widens state.incident_window
when the GitDeployTimelineTool came back empty for a shared-window
query, so the next investigation iteration looks further back.
Guard chain (rule no-ops unless every guard passes):
1. state.incident_window is well-formed (IncidentWindow.from_dict).
2. state.incident_window_history has fewer than MAX_EXPANSIONS (= 2)
entries — bounded so a pathological loop cannot run away.
3. STALE-SIGNAL GUARD: the deploy timeline action must be in the most
recent state.executed_hypotheses[-1].actions. Without this, evidence
from iteration 1 would re-fire the rule at the end of iteration 2
even when the tool didn't run again (caught in plan review).
4. evidence.git_deploy_timeline_window.source == "shared_incident_window".
caller_explicit / tool_default / unset (== {}) all fall through, so
a caller's explicit window is NEVER overridden.
5. evidence.git_deploy_timeline_count == 0.
6. Expansion would actually widen the window (i.e. not already at
MAX_LOOKBACK_MINUTES — IncidentWindow.expanded() would return same
width).
When fired, the rule emits a state delta with the new window plus the
OLD window appended to history with replaced_at + replaced_reason. Tests
inject now_fn for deterministic ISO timestamps.
30 tests covering happy path, every guard short-circuit, the
stale-signal guard specifically, multi-iteration accumulation, the
MAX_EXPANSIONS cap, and defensive shape handling (malformed evidence,
malformed history, malformed executed_hypotheses, immutable state).
The node entry point lands in the next commit; this commit only adds
the pure logic and the constant.
Thin @Traceable wrapper around adapt_incident_window from rules.py. Mirrors the existing convention: node_X function in node.py, package __init__.py re-exports the entry point, top-level app.nodes also re-exports for graph builder convenience. The node: - calls the pure rule with a copy of state (no mutation) - returns {} on no-op (LangGraph's "no state change" signal) - returns the rule's state delta when an expansion fires - logs at INFO + debug_print when an expansion happens so operators can audit which run widened the window and why Tests cover: state delta passthrough, no-op contract, completely empty state (early pipeline), state immutability, INFO log on fire, no INFO log on no-op, RunnableConfig kwarg accepted-and-ignored. The graph wiring lands in the next commit; nothing currently calls this node, so adding it does not affect any existing investigation.
Wire node_adapt_window into the LangGraph between diagnose and the
loop-back to plan_actions. Terminal routing decisions (opensre_eval,
publish) bypass it entirely — adaptation only runs when the
investigation is continuing.
Topology change:
before: diagnose --[conditional "investigate"]--> plan_actions
after: diagnose --[conditional "investigate"]--> adapt_window
|
v
plan_actions
The string "investigate" returned by route_investigation_loop is
preserved (it means "loop again", not "go to the investigate node");
only the destination node mapping changes. No routing.py change needed.
6 graph-introspection smoke tests assert: node registered, loop edge
present, unconditional adapt_window->plan_actions edge present,
terminal paths bypass adapt_window, no direct diagnose->plan_actions
edge remains (regression guard), pre-loop edges unchanged.
Greptile SummaryThis PR inserts a new The implementation is well-structured: the 8-guard rule chain is side-effect-free, the stale-signal guard ( Confidence Score: 5/5Safe to merge — all remaining findings are P2 style/test quality nits with no impact on runtime behaviour. Logic is sound, guard chain is exhaustive, state mutations are non-destructive, and 52 tests cover happy-path, each guard independently, multi-expansion, and defensive shapes. The two P2 findings (stale docstring, dead parametrized test cases) have zero effect on correctness or reliability. No files require special attention. Important Files Changed
Sequence DiagramsequenceDiagram
participant D as diagnose
participant R as route_investigation_loop
participant AW as adapt_window
participant PA as plan_actions
participant E as opensre_eval / publish
D->>R: conditional edge
alt route == investigate
R->>AW: loop-back path
AW->>AW: adapt_incident_window(state)
note over AW: guard chain: window present? history < MAX_EXPANSIONS? deploy timeline ran this iter? evidence present? source == shared_incident_window? commits_count == 0? expansion actually widens?
alt all guards pass
AW-->>PA: {incident_window: widened, incident_window_history: [...old]}
else any guard fails
AW-->>PA: {} (no state change)
end
else route == opensre_eval or publish
R->>E: terminal path (bypasses adapt_window)
end
|
|
🧑💻 @hamzzaaamalik has entered the contributor hall of fame. Merged. Done. Shipped. Go touch grass (then come back with another PR). 🌱 👋 Join us on Discord - OpenSRE : hang out, contribute, or hunt for features and issues. Everyone's welcome. |

PR 3 of the dynamic incident-window work. Builds on #951 and #954 (both merged).
What it does
Inserts a new
adapt_windownode on thediagnose → plan_actionsloop-back edge. When the loop continues, one rule runs:Widen the incident window when the deploy timeline came back empty for a shared-window query.
Specifically: if
get_git_deploy_timelineran this iteration withwindow.source == "shared_incident_window"and returned 0 commits, double the lookback (clamped at 7 days) and record the old window instate.incident_window_history. Next iteration sees the wider window.Capped at 2 expansions per investigation (e.g. 120m → 240m → 480m, then stops). Caller-explicit windows are never overridden. Terminal routing bypasses the node entirely.
Files
New:
app/nodes/adapt_window/{rules.py, node.py, __init__.py}tests/nodes/adapt_window/{test_rules.py, test_node.py}tests/pipeline/test_graph_adapt_window_wiring.pyModified:
app/incident_window.py—IncidentWindow.expanded()helperapp/state/agent_state.py—incident_window_historyfield (drift test green)app/investigation_constants.py—MAX_EXPANSIONS = 2app/nodes/__init__.py,app/pipeline/graph.py— wiringTests
52 new tests, 1993 passing in the broad sweep, 0 regressions.
Not in this PR
Window contraction on deploy anchor (PR 4), LLM-driven adaptation, migrating other tools to the shared window, diagnose narrative changes.