Incident Response Manager

8+ years of experience managing major incidents in high-stakes, always-on environments.
Proven ability to lead multiple incidents simultaneously and influence diverse teams toward resolution.
Strong full-stack technical background, including cloud platforms like AWS or Azure.
Solid infrastructure knowledge—physical, virtual, and containerized systems.
Analytical skills with data tools (SQL, Dynatrace, or similar).
Calm under pressure with strong task management and decision-making skills.
Excellent communicator, able to explain technical issues clearly to all audiences.

Lead Incident Response: Act as Incident Commander during major incidents, directing cross-functional teams to restore services swiftly.
Drive Technical Resolution: Leverage tools like Splunk, SQL, and cloud monitoring to identify root causes and guide remediation.
Own Communications: Deliver timely, clear updates to internal stakeholders and users during incidents.
Problem Management: Lead post-incident analysis (RCA) and drive long-term fixes to prevent repeat issues.
Automation: Build and deploy scripts (Python, Bash) to enhance detection, response, and reporting.
Data-Driven Insights: Analyze incident trends and response effectiveness to inform leadership and guide improvements.
Continuous Improvement: Partner with teams to refine tools, playbooks, and processes that enhance system resilience.
On-Call Leadership: Serve as an escalation point during critical outages.

Familiarity with ITIL processes (certification a plus).
Hands-on experience with scripting/automation (Python, Ruby, JavaScript, or shell).
Experience crafting user-facing incident comms (e.g., status pages, notifications).
Understanding of distributed systems and interdependent architecture.
A track record of improving incident response operations in high-availability environments.