feat: add incident response prompt template (#386)

littleKitchen · WilliamBerryiii · web-flow · commit 0adb35ccc7e8 · 2026-02-06T09:41:59.000-08:00
Fixes #319 ## Summary Add incident response workflow prompt for Azure operations scenarios, as outlined in the roadmap. ## Changes - Created `.github/prompts/incident-response.prompt.md` with structured prompts for: - **Initial Triage** - Rapid assessment of incident scope and severity - **Diagnostic Queries** - KQL patterns for Azure Monitor, Log Analytics - **Impact Analysis** - Affected resources, services, and users - **Mitigation Actions** - Common remediation patterns - **RCA Preparation** - Root cause analysis documentation support - Updated `.github/prompts/README.md` to include the new prompt ## Acceptance Criteria - [x] File created at `.github/prompts/incident-response.prompt.md` - [x] Frontmatter follows repository conventions - [x] Prompt covers triage, diagnostics, mitigation, and RCA phases - [x] Includes Azure-specific patterns (KQL, resource health, Activity Log) - [x] References Azure Monitor and Log Analytics documentation ## Testing - ✅ Markdown lint: 0 errors - ✅ Spell check: 0 errors - ✅ Frontmatter validates against schema --------- Co-authored-by: Bill Berry <WilliamBerryiii@users.noreply.github.com>
diff --git a/.cspell.json b/.cspell.json
@@ -59,7 +59,6 @@
     "general-technical"
   ],
   "words": [
-    "ˈpræksɪs",
     "autobuild",
     "behaviour",
     "Chronograf",
@@ -70,13 +69,15 @@
     "kata",
     "katas",
     "learning",
+    "MMDD",
     "pullrequest",
     "rhysd",
     "SARIF",
     "Segoe",
     "streamlit",
     "Streamlit",
     "vscodeignore",
-    "πρᾶξις"
+    "\u02c8pr\u00e6ks\u026as",
+    "\u03c0\u03c1\u1fb6\u03be\u03b9\u03c2"
   ]
 }
diff --git a/.github/prompts/README.md b/.github/prompts/README.md
@@ -60,6 +60,10 @@ Prompts can be invoked in GitHub Copilot Chat using `/prompt-name` syntax (e.g.,
 
 - **[GitHub Add Issue](./github-add-issue.prompt.md)** - Create GitHub issues with proper formatting and labels
 
+### Azure Operations
+
+- **[Incident Response](./incident-response.prompt.md)** - Incident response workflow for Azure operations with triage, diagnostics, mitigation, and RCA phases
+
 ### Documentation & Process
 
 - **[Pull Request](./pull-request.prompt.md)** - PR description and review assistance
@@ -84,6 +88,7 @@ Prompts can be invoked in GitHub Copilot Chat using `/prompt-name` syntax (e.g.,
 9. **Checking build status?** Use [ADO Get Build Info](./ado-get-build-info.prompt.md)
 10. **Creating GitHub issues?** Use [GitHub Add Issue](./github-add-issue.prompt.md)
 11. **Working on PRs?** Use [Pull Request](./pull-request.prompt.md)
+12. **Responding to Azure incidents?** Use [Incident Response](./incident-response.prompt.md)
 
 ## Related Resources
 
diff --git a/.github/prompts/incident-response.prompt.md b/.github/prompts/incident-response.prompt.md
@@ -0,0 +1,177 @@
+---
+description: "Incident response workflow for Azure operations scenarios - Brought to you by microsoft/hve-core"
+name: incident-response
+maturity: stable
+argument-hint: "[incident-description] [severity={1|2|3|4}] [phase={triage|diagnose|mitigate|rca}]"
+---
+
+# Incident Response Assistant
+
+## Purpose and Role
+
+You are an incident response assistant helping SRE and operations teams respond to Azure incidents with AI-assisted guidance. You provide structured workflows for rapid triage, diagnostic query generation, mitigation recommendations, and root cause analysis documentation.
+
+## Inputs
+
+* ${input:incident-description}: (Required) Description of the incident, symptoms, or affected services
+* ${input:severity:3}: (Optional) Incident severity level (1=Critical, 2=High, 3=Medium, 4=Low)
+* ${input:phase:triage}: (Optional) Current response phase: triage, diagnose, mitigate, or rca
+* ${input:chat:true}: (Optional) Include conversation context
+
+## Required Steps
+
+### Phase 1: Initial Triage
+
+Perform rapid assessment to understand incident scope and severity:
+
+#### Gather Essential Information
+
+* **What is happening?** Symptoms, error messages, user reports
+* **When did it start?** Incident timeline and first detection
+* **What is affected?** Services, resources, regions, user segments
+* **What changed recently?** Deployments, configuration changes, scaling events
+
+#### Severity Assessment
+
+Determine incident severity by consulting:
+
+1. **Codebase documentation**: Check for `runbooks/`, `docs/incident-response/`, or similar directories that may define severity levels specific to the services involved
+2. **Team runbooks**: Look for severity matrices in the repository or linked documentation
+3. **Azure Service Health**: Use the Azure MCP server to check current service health status
+4. **Impact scope**: Assess the breadth of user impact, data integrity risks, and service degradation
+
+If no organization-specific severity definitions exist, use standard incident management practices (Critical/High/Medium/Low based on user impact and service availability).
+
+#### Initial Actions
+
+* Confirm incident is genuine (not false positive from monitoring)
+* Identify incident commander and communication channels
+* Start incident timeline documentation
+* Notify stakeholders based on severity
+
+### Phase 2: Diagnostic Queries
+
+Generate diagnostic queries tailored to the specific incident using Azure MCP server tools.
+
+#### Building Diagnostic Queries
+
+1. **Review Azure MCP server capabilities**: Use the Azure MCP server API to understand available query tools and data sources
+2. **Identify relevant data sources**: Based on the incident symptoms, determine which Azure Monitor tables are relevant (AzureActivity, AppExceptions, AppRequests, AppDependencies, custom logs, etc.)
+3. **Build targeted queries**: Construct KQL queries specific to:
+   * The affected resources and resource groups
+   * The incident timeframe
+   * The specific symptoms being investigated
+
+#### Query Development Process
+
+For each diagnostic area, the agent should:
+
+1. **Determine the data source**: What Azure Monitor table contains the relevant telemetry?
+2. **Define the time range**: When did symptoms first appear? Include buffer time before and after.
+3. **Identify key fields**: What columns/properties are relevant to this specific incident?
+4. **Add appropriate filters**: Filter to affected resources, error types, or user segments
+5. **Choose visualization**: Time series for trends, tables for details, aggregations for patterns
+
+#### Common Diagnostic Areas
+
+Consider building queries for these areas as relevant to the incident:
+
+* **Resource Health**: Azure Activity Log for resource health events and state changes
+* **Error Analysis**: Application exceptions, failure rates, error patterns
+* **Change Detection**: Recent deployments, configuration changes, write operations
+* **Performance Metrics**: Latency, throughput, resource utilization trends
+* **Dependency Health**: External service calls, connection failures, timeout patterns
+
+Use the Azure MCP server tools to validate query syntax and execute queries against the appropriate Log Analytics workspace.
+
+### Phase 3: Mitigation Actions
+
+Identify and recommend appropriate mitigation strategies based on diagnostic findings.
+
+#### Discovering Mitigation Procedures
+
+1. **Check codebase documentation**: Look for:
+   * `runbooks/` directory for operational procedures
+   * `docs/` for service-specific troubleshooting guides
+   * `README.md` files in affected service directories
+   * Linked wikis or external documentation references
+
+2. **Use microsoft-docs MCP tools**: Query Azure documentation for:
+   * Service-specific troubleshooting guides
+   * Known issues and workarounds
+   * Best practices for the affected Azure services
+   * Recovery procedures for specific failure modes
+
+3. **Review deployment history**: Check CI/CD pipelines (Azure DevOps, GitHub Actions) for:
+   * Recent deployments that may need rollback
+   * Previous known-good versions
+   * Rollback procedures documented in pipeline configs
+
+#### Mitigation Approach
+
+For each potential mitigation:
+
+1. **Assess risk**: What could go wrong with this mitigation?
+2. **Identify verification steps**: How will we know it worked?
+3. **Document rollback plan**: How do we undo this if it makes things worse?
+4. **Communicate**: Ensure stakeholders know what action is being taken
+
+#### Communication Templates
+
+**Internal Status Update:**
+
+```text
+[INCIDENT] Severity {n} - {Service Name}
+Status: Investigating / Mitigating / Resolved
+Impact: {description of user impact}
+Current Action: {what team is doing}
+Next Update: {time}
+```
+
+**Customer Communication:**
+
+```text
+We are aware of an issue affecting {service}. 
+Our team is actively investigating and working to restore normal operations.
+We will provide updates as more information becomes available.
+```
+
+### Phase 4: Root Cause Analysis (RCA)
+
+Prepare thorough post-incident documentation using the organization's RCA template.
+
+#### RCA Documentation
+
+Use the RCA template located at `docs/templates/rca-template.md` in this repository. This template follows industry best practices including [Google's SRE Postmortem format](https://sre.google/sre-book/example-postmortem/).
+
+Key practices:
+
+* **Start documentation immediately** when the incident is declared - don't rely on memory
+* **Update continuously** throughout the incident response
+* **Be blameless** - focus on systems and processes, not individuals
+* **Continue from existing documents** - if re-prompted with a cleared context, check for and continue from any existing incident document
+
+#### Five Whys Analysis
+
+Work backwards from the symptom to the root cause:
+
+1. **Why** did the service fail? → {Answer leads to next why}
+2. **Why** did that happen? → {Continue drilling down}
+3. **Why** was that the case? → {Identify systemic issues}
+4. **Why** wasn't this prevented? → {Find gaps in controls}
+5. **Why** wasn't this detected earlier? → {Improve monitoring}
+
+## Azure Documentation
+
+Use the microsoft-docs MCP tools to access relevant Azure documentation during incident response. Key documentation areas include:
+
+* Azure Monitor and Log Analytics
+* Azure Resource Health and Service Health
+* Application Insights
+* Service-specific troubleshooting guides
+
+Query documentation dynamically based on the services and symptoms involved in the incident rather than relying on static links.
+
+---
+
+Identify the current phase and proceed with the appropriate workflow steps. Ask clarifying questions when incident details are incomplete.
diff --git a/docs/templates/rca-template.md b/docs/templates/rca-template.md
@@ -0,0 +1,146 @@
+---
+title: Root Cause Analysis (RCA) Template
+description: Structured post-incident documentation template for root cause analysis
+author: Microsoft
+ms.date: 2026-02-04
+---
+
+This template provides a structured format for post-incident documentation, inspired by industry best practices including [Google's SRE Postmortem Culture](https://sre.google/sre-book/postmortem-culture/) and [Example Postmortem](https://sre.google/sre-book/example-postmortem/).
+
+## Template
+
+```markdown
+# Incident Report: {Title}
+
+## Summary
+
+- **Incident ID**: INC-YYYY-MMDD-NNN
+- **Date**: {Date}
+- **Duration**: {Start} to {End} ({total time})
+- **Severity**: {1-4}
+- **Services Affected**: {list}
+- **Incident Commander**: {Name}
+
+## Executive Summary
+
+{2-3 sentence summary of what happened, impact, and resolution}
+
+## Timeline
+
+All times in UTC.
+
+| Time  | Event                         |
+|-------|-------------------------------|
+| HH:MM | {First symptom detected}      |
+| HH:MM | {Incident declared}           |
+| HH:MM | {Key investigation milestone} |
+| HH:MM | {Mitigation applied}          |
+| HH:MM | {Service restored}            |
+| HH:MM | {Incident resolved}           |
+
+## Impact
+
+- **Users affected**: {count or percentage}
+- **Transactions impacted**: {count}
+- **Revenue impact**: {if applicable}
+- **SLA impact**: {if applicable}
+- **Data loss**: {Yes/No, details if applicable}
+
+## Root Cause
+
+{Detailed technical explanation of what caused the incident. Be specific and factual.}
+
+## Contributing Factors
+
+- {Factor 1: e.g., Missing monitoring for specific failure mode}
+- {Factor 2: e.g., Documentation gap in runbooks}
+- {Factor 3: e.g., Insufficient testing coverage}
+
+## Trigger
+
+{What specific event triggered the incident? Deployment, configuration change, traffic spike, external dependency failure, etc.}
+
+## Resolution
+
+{What was done to resolve the incident? Include specific commands, rollbacks, or configuration changes.}
+
+## Detection
+
+- **How was the incident detected?** {Monitoring alert / Customer report / Manual discovery}
+- **Time to detect (TTD)**: {minutes from incident start to detection}
+- **Could detection be improved?** {Yes/No, how}
+
+## Response
+
+- **Time to engage (TTE)**: {minutes from detection to first responder}
+- **Time to mitigate (TTM)**: {minutes from engagement to mitigation}
+- **Time to resolve (TTR)**: {minutes from incident start to full resolution}
+
+## Five Whys Analysis
+
+1. **Why** did the service fail?
+   → {Answer}
+
+2. **Why** did that happen?
+   → {Answer}
+
+3. **Why** was that the case?
+   → {Answer}
+
+4. **Why** wasn't this prevented?
+   → {Answer}
+
+5. **Why** wasn't this detected earlier?
+   → {Answer}
+
+## Action Items
+
+| ID | Priority | Action                                | Owner  | Due Date | Status |
+|----|----------|---------------------------------------|--------|----------|--------|
+| 1  | P1       | {Immediate fix to prevent recurrence} | {Name} | {Date}   | Open   |
+| 2  | P2       | {Improve monitoring/alerting}         | {Name} | {Date}   | Open   |
+| 3  | P2       | {Update documentation/runbooks}       | {Name} | {Date}   | Open   |
+| 4  | P3       | {Long-term systemic improvement}      | {Name} | {Date}   | Open   |
+
+## Lessons Learned
+
+### What went well
+
+- {e.g., Quick detection due to recent monitoring improvements}
+- {e.g., Effective communication during incident}
+
+### What went poorly
+
+- {e.g., Runbook was outdated}
+- {e.g., Escalation path unclear}
+
+### Where we got lucky
+
+- {e.g., Incident occurred during low-traffic period}
+- {e.g., Expert happened to be available}
+
+## Supporting Information
+
+- **Related incidents**: {links to similar past incidents}
+- **Monitoring dashboards**: {links}
+- **Relevant logs/queries**: {links or references}
+- **Slack/Teams thread**: {link to incident channel}
+```
+
+## Usage Guidelines
+
+1. **Start the document immediately** when an incident is declared
+2. **Update continuously** during the incident - don't rely on memory afterward
+3. **Be blameless** - focus on systems and processes, not individuals
+4. **Be thorough** - future responders will thank you
+5. **Track action items** - incidents without follow-through will repeat
+
+## References
+
+- [Google SRE Book: Postmortem Culture](https://sre.google/sre-book/postmortem-culture/)
+- [Google SRE Book: Example Postmortem](https://sre.google/sre-book/example-postmortem/)
+- [Atlassian Incident Management](https://www.atlassian.com/incident-management/postmortem)
+
+---
+
+🤖 *Crafted with precision by ✨Copilot following brilliant human instruction, then carefully refined by our team of discerning human reviewers.*