Skip to content

[aw] Failure Investigator (6h) #55

[aw] Failure Investigator (6h)

[aw] Failure Investigator (6h) #55

Agentic Workflow file for this run

---
description: Investigates [aw] failures from the last 6 hours, correlates with open agentic-workflows issues, closes fixed issues, and opens focused fix sub-issues when needed
on:
schedule:
- cron: "every 6h"
workflow_dispatch:
permissions:
contents: read
actions: read
issues: read
pull-requests: read
tracker-id: aw-failure-investigator
engine: claude
tools:
cli-proxy: true
agentic-workflows:
github:
toolsets: [default, actions]
bash: ["*"]
safe-outputs:
create-issue:
expires: 7d
title-prefix: "[aw-failures] "
labels: [agentic-workflows, automation, cookie]
max: 2
group: true
update-issue:
target: "*"
max: 10
link-sub-issue:
max: 10
noop:
timeout-minutes: 60
imports:
- shared/reporting.md
---
# [aw] Failure Investigator (6h)
Investigate agentic workflow failures from the last 6 hours and produce actionable issue tracking with sub-issues.
## Scope
- **Repository**: `${{ github.repository }}`
- **Lookback window**: last 6 hours
- **Issue query to inspect first**: <https://github.com/github/gh-aw/issues?q=is%3Aissue%20state%3Aopen%20label%3Aagentic-workflows>
## Mission
1. Find recent failures from agentic workflows in the last 6 hours.
2. Correlate findings with currently open `agentic-workflows` issues.
3. Perform large-scale failure analysis using logs + audit + audit-diff.
4. Close fixed/stale issues first, then create only the minimum necessary linked fix sub-issues.
## Required Investigation Steps
### 1) Fetch and review existing issue context
Find open issues with `agentic-workflows` label using GitHub issues search (equivalent to the URL query above). Focus on issues created/updated in the lookback window and unresolved recurring themes.
Capture:
- Existing failure clusters already tracked
- Gaps where recurring failures are not yet tracked
- Potential duplicates to avoid
### 2) Collect workflow runs and isolate failures (last 6h)
Use `agentic-workflows` MCP `logs` with a 6-hour window (for example `start_date: "-6h"`) and enough count to cover all recent runs.
Build a failure dataset including:
- run id, workflow, engine, status/conclusion
- timestamps and durations
- repeated failure signatures
- affected tools / MCP / firewall / auth / timeout dimensions
### 3) Deep-dive each failure cluster with `audit`
For each meaningful cluster (not every single run if many are equivalent), call `agentic-workflows` MCP `audit` on representative failed runs and at least one successful comparator run when available.
Extract evidence:
- root-cause signals
- dominant error messages
- tool failure patterns
- token/cost/runtime anomalies
- infra vs workflow-definition vs prompt/tooling failure classification
### 4) Compare behavior with `audit-diff`
Use `agentic-workflows` MCP `audit-diff` to compare:
- failed run vs nearest successful run of the same workflow, or
- failed run vs prior failed run to detect drift
Identify regressions and deltas (metrics/tooling/firewall/MCP behavior) that support fix recommendations.
### 5) Close fixed issues first, then add focused sub-issues
First, identify currently open `agentic-workflows` issues that are now fixed, stale, or no longer actionable based on fresh evidence, and close them using `update-issue`.
Then, if new uncovered work remains, add **sub-issues** for concrete fixes to the **most recent open parent report issue** instead of creating a new parent by default.
Only create a new parent report issue (temporary ID format `aw_` + 3-8 alphanumeric characters) when **P0 failures have no existing tracking coverage**.
Each new sub-issue must include:
- clear problem statement
- affected workflows and run IDs
- probable root cause
- specific proposed remediation
- success criteria / verification
## Output Requirements
### Parent report issue structure
Include these sections:
1. Executive summary
2. Failure clusters (table)
3. Evidence (logs/audit/audit-diff)
4. Existing issue correlation
5. Proposed fix roadmap (P0/P1/P2)
6. Sub-issues created
### Sub-issue quality bar
- Prefer a few high-quality, actionable sub-issues over many weak ones.
- Avoid duplicates of already-open issues unless new evidence materially changes scope.
- Reference the parent issue and the concrete run IDs analyzed.
## Decision Rules
- If there are **no failures** in the last 6h, or no actionable delta vs existing issues, call `noop` with a concise reason.
- If failures exist but are already fully tracked, prefer closing stale/fixed issues and avoid creating new issues.
- Only create a new parent report issue when P0 failures have no existing tracking coverage.
- Prefer closing stale/fixed issues over creating new issues when issue volume is high.
- Always be explicit about confidence and unknowns.
**Important**: If no action is needed after completing your analysis, you **MUST** call the `noop` safe-output tool with a brief explanation.
```json
{"noop": {"message": "No action needed: [brief explanation of what was analyzed and why]"}}
```