Incident Response
internalSeverity model, triage, mitigation, and communication flow.
Severity
SEV-1: broad outage or critical data/security riskSEV-2: major feature unavailable or heavily degradedSEV-3: partial impact with workaround
First 15 minutes
- Confirm impact scope and affected environments.
- Check recent deploys/config changes.
- Validate health endpoints and dependencies.
- Decide rollback vs runtime mitigation.
Mitigation options
- rollback latest deployment
- disable risky paths with runtime flags
- pause non-critical async processing
- switch provider path when single provider is degraded
Closeout
- verify core journeys are healthy
- publish postmortem with corrective actions
- update runbooks/alerts/tests
Detailed operations copy: docs/runbooks/incident-response.md