Incident Response Manager
Company | Toyota |
---|---|
Location | Plano, TX, USA |
Salary | $Not Provided – $Not Provided |
Type | Full-Time |
Degrees | |
Experience Level | Senior, Expert or higher |
Requirements
- 8+ years of experience managing major incidents in high-stakes, always-on environments.
- Proven ability to lead multiple incidents simultaneously and influence diverse teams toward resolution.
- Strong full-stack technical background, including cloud platforms like AWS or Azure.
- Solid infrastructure knowledge—physical, virtual, and containerized systems.
- Analytical skills with data tools (SQL, Dynatrace, or similar).
- Calm under pressure with strong task management and decision-making skills.
- Excellent communicator, able to explain technical issues clearly to all audiences.
Responsibilities
- Lead Incident Response: Act as Incident Commander during major incidents, directing cross-functional teams to restore services swiftly.
- Drive Technical Resolution: Leverage tools like Splunk, SQL, and cloud monitoring to identify root causes and guide remediation.
- Own Communications: Deliver timely, clear updates to internal stakeholders and users during incidents.
- Problem Management: Lead post-incident analysis (RCA) and drive long-term fixes to prevent repeat issues.
- Automation: Build and deploy scripts (Python, Bash) to enhance detection, response, and reporting.
- Data-Driven Insights: Analyze incident trends and response effectiveness to inform leadership and guide improvements.
- Continuous Improvement: Partner with teams to refine tools, playbooks, and processes that enhance system resilience.
- On-Call Leadership: Serve as an escalation point during critical outages.
Preferred Qualifications
- Familiarity with ITIL processes (certification a plus).
- Hands-on experience with scripting/automation (Python, Ruby, JavaScript, or shell).
- Experience crafting user-facing incident comms (e.g., status pages, notifications).
- Understanding of distributed systems and interdependent architecture.
- A track record of improving incident response operations in high-availability environments.