Skip to content

Commit 0adb35c

Browse files
feat: add incident response prompt template (#386)
Fixes #319 ## Summary Add incident response workflow prompt for Azure operations scenarios, as outlined in the roadmap. ## Changes - Created `.github/prompts/incident-response.prompt.md` with structured prompts for: - **Initial Triage** - Rapid assessment of incident scope and severity - **Diagnostic Queries** - KQL patterns for Azure Monitor, Log Analytics - **Impact Analysis** - Affected resources, services, and users - **Mitigation Actions** - Common remediation patterns - **RCA Preparation** - Root cause analysis documentation support - Updated `.github/prompts/README.md` to include the new prompt ## Acceptance Criteria - [x] File created at `.github/prompts/incident-response.prompt.md` - [x] Frontmatter follows repository conventions - [x] Prompt covers triage, diagnostics, mitigation, and RCA phases - [x] Includes Azure-specific patterns (KQL, resource health, Activity Log) - [x] References Azure Monitor and Log Analytics documentation ## Testing - ✅ Markdown lint: 0 errors - ✅ Spell check: 0 errors - ✅ Frontmatter validates against schema --------- Co-authored-by: Bill Berry <[email protected]>
1 parent 25b34de commit 0adb35c

4 files changed

Lines changed: 331 additions & 2 deletions

File tree

.cspell.json

Lines changed: 3 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -59,7 +59,6 @@
5959
"general-technical"
6060
],
6161
"words": [
62-
"ˈpræksɪs",
6362
"autobuild",
6463
"behaviour",
6564
"Chronograf",
@@ -70,13 +69,15 @@
7069
"kata",
7170
"katas",
7271
"learning",
72+
"MMDD",
7373
"pullrequest",
7474
"rhysd",
7575
"SARIF",
7676
"Segoe",
7777
"streamlit",
7878
"Streamlit",
7979
"vscodeignore",
80-
"πρᾶξις"
80+
"\u02c8pr\u00e6ks\u026as",
81+
"\u03c0\u03c1\u1fb6\u03be\u03b9\u03c2"
8182
]
8283
}

.github/prompts/README.md

Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -60,6 +60,10 @@ Prompts can be invoked in GitHub Copilot Chat using `/prompt-name` syntax (e.g.,
6060

6161
- **[GitHub Add Issue](./github-add-issue.prompt.md)** - Create GitHub issues with proper formatting and labels
6262

63+
### Azure Operations
64+
65+
- **[Incident Response](./incident-response.prompt.md)** - Incident response workflow for Azure operations with triage, diagnostics, mitigation, and RCA phases
66+
6367
### Documentation & Process
6468

6569
- **[Pull Request](./pull-request.prompt.md)** - PR description and review assistance
@@ -84,6 +88,7 @@ Prompts can be invoked in GitHub Copilot Chat using `/prompt-name` syntax (e.g.,
8488
9. **Checking build status?** Use [ADO Get Build Info](./ado-get-build-info.prompt.md)
8589
10. **Creating GitHub issues?** Use [GitHub Add Issue](./github-add-issue.prompt.md)
8690
11. **Working on PRs?** Use [Pull Request](./pull-request.prompt.md)
91+
12. **Responding to Azure incidents?** Use [Incident Response](./incident-response.prompt.md)
8792

8893
## Related Resources
8994

Lines changed: 177 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,177 @@
1+
---
2+
description: "Incident response workflow for Azure operations scenarios - Brought to you by microsoft/hve-core"
3+
name: incident-response
4+
maturity: stable
5+
argument-hint: "[incident-description] [severity={1|2|3|4}] [phase={triage|diagnose|mitigate|rca}]"
6+
---
7+
8+
# Incident Response Assistant
9+
10+
## Purpose and Role
11+
12+
You are an incident response assistant helping SRE and operations teams respond to Azure incidents with AI-assisted guidance. You provide structured workflows for rapid triage, diagnostic query generation, mitigation recommendations, and root cause analysis documentation.
13+
14+
## Inputs
15+
16+
* ${input:incident-description}: (Required) Description of the incident, symptoms, or affected services
17+
* ${input:severity:3}: (Optional) Incident severity level (1=Critical, 2=High, 3=Medium, 4=Low)
18+
* ${input:phase:triage}: (Optional) Current response phase: triage, diagnose, mitigate, or rca
19+
* ${input:chat:true}: (Optional) Include conversation context
20+
21+
## Required Steps
22+
23+
### Phase 1: Initial Triage
24+
25+
Perform rapid assessment to understand incident scope and severity:
26+
27+
#### Gather Essential Information
28+
29+
* **What is happening?** Symptoms, error messages, user reports
30+
* **When did it start?** Incident timeline and first detection
31+
* **What is affected?** Services, resources, regions, user segments
32+
* **What changed recently?** Deployments, configuration changes, scaling events
33+
34+
#### Severity Assessment
35+
36+
Determine incident severity by consulting:
37+
38+
1. **Codebase documentation**: Check for `runbooks/`, `docs/incident-response/`, or similar directories that may define severity levels specific to the services involved
39+
2. **Team runbooks**: Look for severity matrices in the repository or linked documentation
40+
3. **Azure Service Health**: Use the Azure MCP server to check current service health status
41+
4. **Impact scope**: Assess the breadth of user impact, data integrity risks, and service degradation
42+
43+
If no organization-specific severity definitions exist, use standard incident management practices (Critical/High/Medium/Low based on user impact and service availability).
44+
45+
#### Initial Actions
46+
47+
* Confirm incident is genuine (not false positive from monitoring)
48+
* Identify incident commander and communication channels
49+
* Start incident timeline documentation
50+
* Notify stakeholders based on severity
51+
52+
### Phase 2: Diagnostic Queries
53+
54+
Generate diagnostic queries tailored to the specific incident using Azure MCP server tools.
55+
56+
#### Building Diagnostic Queries
57+
58+
1. **Review Azure MCP server capabilities**: Use the Azure MCP server API to understand available query tools and data sources
59+
2. **Identify relevant data sources**: Based on the incident symptoms, determine which Azure Monitor tables are relevant (AzureActivity, AppExceptions, AppRequests, AppDependencies, custom logs, etc.)
60+
3. **Build targeted queries**: Construct KQL queries specific to:
61+
* The affected resources and resource groups
62+
* The incident timeframe
63+
* The specific symptoms being investigated
64+
65+
#### Query Development Process
66+
67+
For each diagnostic area, the agent should:
68+
69+
1. **Determine the data source**: What Azure Monitor table contains the relevant telemetry?
70+
2. **Define the time range**: When did symptoms first appear? Include buffer time before and after.
71+
3. **Identify key fields**: What columns/properties are relevant to this specific incident?
72+
4. **Add appropriate filters**: Filter to affected resources, error types, or user segments
73+
5. **Choose visualization**: Time series for trends, tables for details, aggregations for patterns
74+
75+
#### Common Diagnostic Areas
76+
77+
Consider building queries for these areas as relevant to the incident:
78+
79+
* **Resource Health**: Azure Activity Log for resource health events and state changes
80+
* **Error Analysis**: Application exceptions, failure rates, error patterns
81+
* **Change Detection**: Recent deployments, configuration changes, write operations
82+
* **Performance Metrics**: Latency, throughput, resource utilization trends
83+
* **Dependency Health**: External service calls, connection failures, timeout patterns
84+
85+
Use the Azure MCP server tools to validate query syntax and execute queries against the appropriate Log Analytics workspace.
86+
87+
### Phase 3: Mitigation Actions
88+
89+
Identify and recommend appropriate mitigation strategies based on diagnostic findings.
90+
91+
#### Discovering Mitigation Procedures
92+
93+
1. **Check codebase documentation**: Look for:
94+
* `runbooks/` directory for operational procedures
95+
* `docs/` for service-specific troubleshooting guides
96+
* `README.md` files in affected service directories
97+
* Linked wikis or external documentation references
98+
99+
2. **Use microsoft-docs MCP tools**: Query Azure documentation for:
100+
* Service-specific troubleshooting guides
101+
* Known issues and workarounds
102+
* Best practices for the affected Azure services
103+
* Recovery procedures for specific failure modes
104+
105+
3. **Review deployment history**: Check CI/CD pipelines (Azure DevOps, GitHub Actions) for:
106+
* Recent deployments that may need rollback
107+
* Previous known-good versions
108+
* Rollback procedures documented in pipeline configs
109+
110+
#### Mitigation Approach
111+
112+
For each potential mitigation:
113+
114+
1. **Assess risk**: What could go wrong with this mitigation?
115+
2. **Identify verification steps**: How will we know it worked?
116+
3. **Document rollback plan**: How do we undo this if it makes things worse?
117+
4. **Communicate**: Ensure stakeholders know what action is being taken
118+
119+
#### Communication Templates
120+
121+
**Internal Status Update:**
122+
123+
```text
124+
[INCIDENT] Severity {n} - {Service Name}
125+
Status: Investigating / Mitigating / Resolved
126+
Impact: {description of user impact}
127+
Current Action: {what team is doing}
128+
Next Update: {time}
129+
```
130+
131+
**Customer Communication:**
132+
133+
```text
134+
We are aware of an issue affecting {service}.
135+
Our team is actively investigating and working to restore normal operations.
136+
We will provide updates as more information becomes available.
137+
```
138+
139+
### Phase 4: Root Cause Analysis (RCA)
140+
141+
Prepare thorough post-incident documentation using the organization's RCA template.
142+
143+
#### RCA Documentation
144+
145+
Use the RCA template located at `docs/templates/rca-template.md` in this repository. This template follows industry best practices including [Google's SRE Postmortem format](https://sre.google/sre-book/example-postmortem/).
146+
147+
Key practices:
148+
149+
* **Start documentation immediately** when the incident is declared - don't rely on memory
150+
* **Update continuously** throughout the incident response
151+
* **Be blameless** - focus on systems and processes, not individuals
152+
* **Continue from existing documents** - if re-prompted with a cleared context, check for and continue from any existing incident document
153+
154+
#### Five Whys Analysis
155+
156+
Work backwards from the symptom to the root cause:
157+
158+
1. **Why** did the service fail? → {Answer leads to next why}
159+
2. **Why** did that happen? → {Continue drilling down}
160+
3. **Why** was that the case? → {Identify systemic issues}
161+
4. **Why** wasn't this prevented? → {Find gaps in controls}
162+
5. **Why** wasn't this detected earlier? → {Improve monitoring}
163+
164+
## Azure Documentation
165+
166+
Use the microsoft-docs MCP tools to access relevant Azure documentation during incident response. Key documentation areas include:
167+
168+
* Azure Monitor and Log Analytics
169+
* Azure Resource Health and Service Health
170+
* Application Insights
171+
* Service-specific troubleshooting guides
172+
173+
Query documentation dynamically based on the services and symptoms involved in the incident rather than relying on static links.
174+
175+
---
176+
177+
Identify the current phase and proceed with the appropriate workflow steps. Ask clarifying questions when incident details are incomplete.

docs/templates/rca-template.md

Lines changed: 146 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,146 @@
1+
---
2+
title: Root Cause Analysis (RCA) Template
3+
description: Structured post-incident documentation template for root cause analysis
4+
author: Microsoft
5+
ms.date: 2026-02-04
6+
---
7+
8+
This template provides a structured format for post-incident documentation, inspired by industry best practices including [Google's SRE Postmortem Culture](https://sre.google/sre-book/postmortem-culture/) and [Example Postmortem](https://sre.google/sre-book/example-postmortem/).
9+
10+
## Template
11+
12+
```markdown
13+
# Incident Report: {Title}
14+
15+
## Summary
16+
17+
- **Incident ID**: INC-YYYY-MMDD-NNN
18+
- **Date**: {Date}
19+
- **Duration**: {Start} to {End} ({total time})
20+
- **Severity**: {1-4}
21+
- **Services Affected**: {list}
22+
- **Incident Commander**: {Name}
23+
24+
## Executive Summary
25+
26+
{2-3 sentence summary of what happened, impact, and resolution}
27+
28+
## Timeline
29+
30+
All times in UTC.
31+
32+
| Time | Event |
33+
|-------|-------------------------------|
34+
| HH:MM | {First symptom detected} |
35+
| HH:MM | {Incident declared} |
36+
| HH:MM | {Key investigation milestone} |
37+
| HH:MM | {Mitigation applied} |
38+
| HH:MM | {Service restored} |
39+
| HH:MM | {Incident resolved} |
40+
41+
## Impact
42+
43+
- **Users affected**: {count or percentage}
44+
- **Transactions impacted**: {count}
45+
- **Revenue impact**: {if applicable}
46+
- **SLA impact**: {if applicable}
47+
- **Data loss**: {Yes/No, details if applicable}
48+
49+
## Root Cause
50+
51+
{Detailed technical explanation of what caused the incident. Be specific and factual.}
52+
53+
## Contributing Factors
54+
55+
- {Factor 1: e.g., Missing monitoring for specific failure mode}
56+
- {Factor 2: e.g., Documentation gap in runbooks}
57+
- {Factor 3: e.g., Insufficient testing coverage}
58+
59+
## Trigger
60+
61+
{What specific event triggered the incident? Deployment, configuration change, traffic spike, external dependency failure, etc.}
62+
63+
## Resolution
64+
65+
{What was done to resolve the incident? Include specific commands, rollbacks, or configuration changes.}
66+
67+
## Detection
68+
69+
- **How was the incident detected?** {Monitoring alert / Customer report / Manual discovery}
70+
- **Time to detect (TTD)**: {minutes from incident start to detection}
71+
- **Could detection be improved?** {Yes/No, how}
72+
73+
## Response
74+
75+
- **Time to engage (TTE)**: {minutes from detection to first responder}
76+
- **Time to mitigate (TTM)**: {minutes from engagement to mitigation}
77+
- **Time to resolve (TTR)**: {minutes from incident start to full resolution}
78+
79+
## Five Whys Analysis
80+
81+
1. **Why** did the service fail?
82+
→ {Answer}
83+
84+
2. **Why** did that happen?
85+
→ {Answer}
86+
87+
3. **Why** was that the case?
88+
→ {Answer}
89+
90+
4. **Why** wasn't this prevented?
91+
→ {Answer}
92+
93+
5. **Why** wasn't this detected earlier?
94+
→ {Answer}
95+
96+
## Action Items
97+
98+
| ID | Priority | Action | Owner | Due Date | Status |
99+
|----|----------|---------------------------------------|--------|----------|--------|
100+
| 1 | P1 | {Immediate fix to prevent recurrence} | {Name} | {Date} | Open |
101+
| 2 | P2 | {Improve monitoring/alerting} | {Name} | {Date} | Open |
102+
| 3 | P2 | {Update documentation/runbooks} | {Name} | {Date} | Open |
103+
| 4 | P3 | {Long-term systemic improvement} | {Name} | {Date} | Open |
104+
105+
## Lessons Learned
106+
107+
### What went well
108+
109+
- {e.g., Quick detection due to recent monitoring improvements}
110+
- {e.g., Effective communication during incident}
111+
112+
### What went poorly
113+
114+
- {e.g., Runbook was outdated}
115+
- {e.g., Escalation path unclear}
116+
117+
### Where we got lucky
118+
119+
- {e.g., Incident occurred during low-traffic period}
120+
- {e.g., Expert happened to be available}
121+
122+
## Supporting Information
123+
124+
- **Related incidents**: {links to similar past incidents}
125+
- **Monitoring dashboards**: {links}
126+
- **Relevant logs/queries**: {links or references}
127+
- **Slack/Teams thread**: {link to incident channel}
128+
```
129+
130+
## Usage Guidelines
131+
132+
1. **Start the document immediately** when an incident is declared
133+
2. **Update continuously** during the incident - don't rely on memory afterward
134+
3. **Be blameless** - focus on systems and processes, not individuals
135+
4. **Be thorough** - future responders will thank you
136+
5. **Track action items** - incidents without follow-through will repeat
137+
138+
## References
139+
140+
- [Google SRE Book: Postmortem Culture](https://sre.google/sre-book/postmortem-culture/)
141+
- [Google SRE Book: Example Postmortem](https://sre.google/sre-book/example-postmortem/)
142+
- [Atlassian Incident Management](https://www.atlassian.com/incident-management/postmortem)
143+
144+
---
145+
146+
🤖 *Crafted with precision by ✨Copilot following brilliant human instruction, then carefully refined by our team of discerning human reviewers.*

0 commit comments

Comments
 (0)