Overview
In IAM, when things break, they break loudly. An IdP outage means nobody works. A bad provisioning rule can accidentally terminate the CEO. As a consultant, you are often the first person called during a catastrophe. Your technical skills matter, but your ability to remain calm, communicate clearly, and lead the Root Cause Analysis (RCA) is what defines your reputation. Panic is contagious; so is leadership.
Methodology & Frameworks
The Incident Command System (ICS)
Adopt a simplified ICS structure for major outages.
- Incident Commander (IC): Runs the call. Does NOT touch the keyboard.
- Scribe: Writes down everything (Time, Action, Result).
- Ops Lead: The hands-on keyboard tech fixing the issue.
- Comms Lead: Talks to the stakeholders/users.
The "5 Whys" for RCA
Don't stop at the symptom.
- Problem: CEO was disabled.
- Why? AD account was moved to "Disabled" OU.
- Why? The IAM tool triggered a disable event.
- Why? HR sent a "Terminated" status.
- Why? The HR admin clicked the wrong code.
- Why? The HR UI has "Terminate" and "Transfer" buttons right next to each other.
- Root Cause: Poor UI design in HR system (Process/UX failure), not a bug in IAM.
Key Decisions
| Decision | Options | Recommendation | Notes / Gotchas |
|---|---|---|---|
| Communication | Over-communicate vs. Silence | Over-communicate. | "We are investigating" is better than silence. Silence breeds rumors. |
| Break Glass | Use it vs. Wait | Use it (if ETA > 15m). | If SSO is down, activate the "Break Glass" local admin accounts immediately. Don't wait 2 hours hoping it comes back. |
| Rollback | Revert Change vs. Fix Forward | Revert Change. | If a deployment caused the outage, undo it. Debug in QA, not Prod. |
| Blame | Name Names vs. Process Failure | Process Failure. | Never blame a person in the RCA. "The process allowed a user to..." |
Implementation Approach
Phase 1: Containment (Stop the Bleeding)
Activity: Isolate the issue.
- If a bad rule is rogue, disable the scheduler.
- If SSO is down, reroute traffic or enable bypass.
- Goal: Restore service, even if it's ugly.
Phase 2: Restoration
Activity: Bring systems back online carefully.
- Clear caches.
- Restart services one by one.
- Verify with a "Canary" user before opening the floodgates.
Phase 3: Investigation (RCA)
Activity: Gather logs, timestamps, and screenshots.
- Construct the "Timeline of Events."
- Interview the engineers involved.
Phase 4: Remediation (Prevent Recurrence)
Activity: The "Action Items."
- "Move the button in HR."
- "Add a threshold check (Stop if >10 disables)."
- "Update the runbook."
Deliverables
- Incident Status Page: Real-time updates for users.
- Root Cause Analysis (RCA) Document: The "Post-Mortem."
- Remediation Plan: JIRA tickets to fix the root cause.
- Runbook Update: "If this happens again, do X."
Risks & Failure Modes
| Risk | Likelihood | Impact | Early Signals | Mitigation |
|---|---|---|---|---|
| The "Hero" Complex | Med | Med | One engineer tries to fix it alone without telling anyone. | "I've got this, give me 5 minutes" (for 2 hours). |
| Log Rollover | High | High | Logs are overwritten before they can be saved. | "I can't find the error." |
| Exec Interference | Med | Med | CEO calling the engineer every 5 minutes. | IC intercepts the CEO. "I will update you every 30 mins. Let the team work." |
| False Fix | Med | High | Thinking it's fixed, announcing it, then it crashes again. | Verify with real users before announcing "Resolved." |
KPIs / Outcomes
- MTTR (Mean Time To Recovery): How fast did we fix it?
- MTTD (Mean Time To Detection): How fast did we know it was broken?
- Recurrence Rate: Did the exact same issue happen again? (Target: 0).
- Communication Sentiment: Did users feel informed?
Consultant's Notebook (Soft Skills)
The "Fog of War"
- In the first 10 minutes of a crisis, 50% of the information is wrong.
- "The server is down!" (Actually, the network switch is down).
- Rule: Trust but verify. "Can you show me the error screen?"
Apologize Correctly
- Bad: "We apologize for the inconvenience." (Robot).
- Good: "We know this disrupted your work and we are sorry. Here is what we are doing to make sure it doesn't happen again." (Human).
Never Waste a Good Crisis
- A crisis is the best time to get budget.
- "We had this outage because our server is 10 years old."
- Result: Approval for new HA infrastructure signed the next day.
- Use the RCA to drive necessary maturity improvements that were previously ignored.
