Crisis Management & Root Cause

Overview

In IAM, when things break, they break loudly. An IdP outage means nobody works. A bad provisioning rule can accidentally terminate the CEO. As a consultant, you are often the first person called during a catastrophe. Your technical skills matter, but your ability to remain calm, communicate clearly, and lead the Root Cause Analysis (RCA) is what defines your reputation. Panic is contagious; so is leadership.

Methodology & Frameworks

The Incident Command System (ICS)

Adopt a simplified ICS structure for major outages.

Incident Commander (IC): Runs the call. Does NOT touch the keyboard.
Scribe: Writes down everything (Time, Action, Result).
Ops Lead: The hands-on keyboard tech fixing the issue.
Comms Lead: Talks to the stakeholders/users.

The "5 Whys" for RCA

Don't stop at the symptom.

Problem: CEO was disabled.

Why? AD account was moved to "Disabled" OU.
Why? The IAM tool triggered a disable event.
Why? HR sent a "Terminated" status.
Why? The HR admin clicked the wrong code.
Why? The HR UI has "Terminate" and "Transfer" buttons right next to each other.

Root Cause: Poor UI design in HR system (Process/UX failure), not a bug in IAM.

Key Decisions

Decision	Options	Recommendation	Notes / Gotchas
Communication	Over-communicate vs. Silence	Over-communicate.	"We are investigating" is better than silence. Silence breeds rumors.
Break Glass	Use it vs. Wait	Use it (if ETA > 15m).	If SSO is down, activate the "Break Glass" local admin accounts immediately. Don't wait 2 hours hoping it comes back.
Rollback	Revert Change vs. Fix Forward	Revert Change.	If a deployment caused the outage, undo it. Debug in QA, not Prod.
Blame	Name Names vs. Process Failure	Process Failure.	Never blame a person in the RCA. "The process allowed a user to..."

Implementation Approach

Phase 1: Containment (Stop the Bleeding)

Activity: Isolate the issue.

If a bad rule is rogue, disable the scheduler.
If SSO is down, reroute traffic or enable bypass.
Goal: Restore service, even if it's ugly.

Phase 2: Restoration

Activity: Bring systems back online carefully.

Clear caches.
Restart services one by one.
Verify with a "Canary" user before opening the floodgates.

Phase 3: Investigation (RCA)

Activity: Gather logs, timestamps, and screenshots.

Construct the "Timeline of Events."
Interview the engineers involved.

Phase 4: Remediation (Prevent Recurrence)

Activity: The "Action Items."

"Move the button in HR."
"Add a threshold check (Stop if >10 disables)."
"Update the runbook."

Deliverables

Incident Status Page: Real-time updates for users.
Root Cause Analysis (RCA) Document: The "Post-Mortem."
Remediation Plan: JIRA tickets to fix the root cause.
Runbook Update: "If this happens again, do X."

Risks & Failure Modes

Risk	Likelihood	Impact	Early Signals	Mitigation
The "Hero" Complex	Med	Med	One engineer tries to fix it alone without telling anyone.	"I've got this, give me 5 minutes" (for 2 hours).
Log Rollover	High	High	Logs are overwritten before they can be saved.	"I can't find the error."
Exec Interference	Med	Med	CEO calling the engineer every 5 minutes.	IC intercepts the CEO. "I will update you every 30 mins. Let the team work."
False Fix	Med	High	Thinking it's fixed, announcing it, then it crashes again.	Verify with real users before announcing "Resolved."

KPIs / Outcomes

MTTR (Mean Time To Recovery): How fast did we fix it?
MTTD (Mean Time To Detection): How fast did we know it was broken?
Recurrence Rate: Did the exact same issue happen again? (Target: 0).
Communication Sentiment: Did users feel informed?

Consultant's Notebook (Soft Skills)

The "Fog of War"

In the first 10 minutes of a crisis, 50% of the information is wrong.
"The server is down!" (Actually, the network switch is down).
Rule: Trust but verify. "Can you show me the error screen?"

Apologize Correctly

Bad: "We apologize for the inconvenience." (Robot).
Good: "We know this disrupted your work and we are sorry. Here is what we are doing to make sure it doesn't happen again." (Human).

Never Waste a Good Crisis

A crisis is the best time to get budget.
"We had this outage because our server is 10 years old."
Result: Approval for new HA infrastructure signed the next day.
Use the RCA to drive necessary maturity improvements that were previously ignored.