[aw] Failure Investigator (6h) #55

Agentic Workflow file for this run

.github/workflows/aw-failure-investigator.md at a5a0353

	---
	description: Investigates [aw] failures from the last 6 hours, correlates with open agentic-workflows issues, closes fixed issues, and opens focused fix sub-issues when needed
	on:
	schedule:
	- cron: "every 6h"
	workflow_dispatch:
	permissions:
	contents: read
	actions: read
	issues: read
	pull-requests: read
	tracker-id: aw-failure-investigator
	engine: claude
	tools:
	cli-proxy: true
	agentic-workflows:
	github:
	toolsets: [default, actions]
	bash: ["*"]
	safe-outputs:
	create-issue:
	expires: 7d
	title-prefix: "[aw-failures] "
	labels: [agentic-workflows, automation, cookie]
	max: 2
	group: true
	update-issue:
	target: "*"
	max: 10
	link-sub-issue:
	max: 10
	noop:
	timeout-minutes: 60
	imports:
	- shared/reporting.md
	---

	# [aw] Failure Investigator (6h)

	Investigate agentic workflow failures from the last 6 hours and produce actionable issue tracking with sub-issues.

	## Scope

	- Repository: `${{ github.repository }}`
	- Lookback window: last 6 hours
	- Issue query to inspect first: <https://github.com/github/gh-aw/issues?q=is%3Aissue%20state%3Aopen%20label%3Aagentic-workflows>

	## Mission

	1. Find recent failures from agentic workflows in the last 6 hours.
	2. Correlate findings with currently open `agentic-workflows` issues.
	3. Perform large-scale failure analysis using logs + audit + audit-diff.
	4. Close fixed/stale issues first, then create only the minimum necessary linked fix sub-issues.

	## Required Investigation Steps

	### 1) Fetch and review existing issue context

	Find open issues with `agentic-workflows` label using GitHub issues search (equivalent to the URL query above). Focus on issues created/updated in the lookback window and unresolved recurring themes.

	Capture:
	- Existing failure clusters already tracked
	- Gaps where recurring failures are not yet tracked
	- Potential duplicates to avoid

	### 2) Collect workflow runs and isolate failures (last 6h)

	Use `agentic-workflows` MCP `logs` with a 6-hour window (for example `start_date: "-6h"`) and enough count to cover all recent runs.

	Build a failure dataset including:
	- run id, workflow, engine, status/conclusion
	- timestamps and durations
	- repeated failure signatures
	- affected tools / MCP / firewall / auth / timeout dimensions

	### 3) Deep-dive each failure cluster with `audit`

	For each meaningful cluster (not every single run if many are equivalent), call `agentic-workflows` MCP `audit` on representative failed runs and at least one successful comparator run when available.

	Extract evidence:
	- root-cause signals
	- dominant error messages
	- tool failure patterns
	- token/cost/runtime anomalies
	- infra vs workflow-definition vs prompt/tooling failure classification

	### 4) Compare behavior with `audit-diff`

	Use `agentic-workflows` MCP `audit-diff` to compare:
	- failed run vs nearest successful run of the same workflow, or
	- failed run vs prior failed run to detect drift

	Identify regressions and deltas (metrics/tooling/firewall/MCP behavior) that support fix recommendations.

	### 5) Close fixed issues first, then add focused sub-issues

	First, identify currently open `agentic-workflows` issues that are now fixed, stale, or no longer actionable based on fresh evidence, and close them using `update-issue`.

	Then, if new uncovered work remains, add sub-issues for concrete fixes to the most recent open parent report issue instead of creating a new parent by default.

	Only create a new parent report issue (temporary ID format `aw_` + 3-8 alphanumeric characters) when P0 failures have no existing tracking coverage.

	Each new sub-issue must include:
	- clear problem statement
	- affected workflows and run IDs
	- probable root cause
	- specific proposed remediation
	- success criteria / verification

	## Output Requirements

	### Parent report issue structure

	Include these sections:
	1. Executive summary
	2. Failure clusters (table)
	3. Evidence (logs/audit/audit-diff)
	4. Existing issue correlation
	5. Proposed fix roadmap (P0/P1/P2)
	6. Sub-issues created

	### Sub-issue quality bar

	- Prefer a few high-quality, actionable sub-issues over many weak ones.
	- Avoid duplicates of already-open issues unless new evidence materially changes scope.
	- Reference the parent issue and the concrete run IDs analyzed.

	## Decision Rules

	- If there are no failures in the last 6h, or no actionable delta vs existing issues, call `noop` with a concise reason.
	- If failures exist but are already fully tracked, prefer closing stale/fixed issues and avoid creating new issues.
	- Only create a new parent report issue when P0 failures have no existing tracking coverage.
	- Prefer closing stale/fixed issues over creating new issues when issue volume is high.
	- Always be explicit about confidence and unknowns.

	Important: If no action is needed after completing your analysis, you MUST call the `noop` safe-output tool with a brief explanation.

	```json
	{"noop": {"message": "No action needed: [brief explanation of what was analyzed and why]"}}
	```

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[aw] Failure Investigator (6h) #55

Agentic workflow file

[aw] Failure Investigator (6h) #55

Uh oh!

Agentic Workflow file for this run