Skip to content

Director of Site Reliability Engineering
Company | Veeam Software |
---|
Location | Seattle, WA, USA |
---|
Salary | $239600 – $342300 |
---|
Type | Full-Time |
---|
Degrees | Bachelor’s, Master’s |
---|
Experience Level | Senior, Expert or higher |
---|
Requirements
- 5+ years of experience leading SRE teams operating high-scale, cloud-native SaaS products.
- 7+ years of hands-on SRE experience in fast-paced, high-growth software companies.
- Proven experience building and scaling on-call rotations, improving incident management processes, and establishing operational best practices.
- Deep expertise in public cloud infrastructure, ideally Azure.
- Strong understanding of Kubernetes, Infrastructure as Code (IaC), and modern observability practices (e.g., distributed tracing, metrics, and logging).
- Experience implementing secure development practices, CI/CD pipelines, and operational processes in compliance-focused environments
- Demonstrated success managing cross-functional teams and collaborating with engineering, support, security, and other stakeholders
- Experience presenting to executives in high-pressure situations.
- Experience managing vendor relationships and external partnerships
- Bachelor’s degree in Computer Science, Information Security, or a related field (Master’s degree preferred)
Responsibilities
- Define and drive SRE strategy: Establish and implement a vision for reliability, availability, and operational excellence across all VDC systems.
- Lead incident and change management: Manage and improve processes to improve incident response, root cause analysis, and change control, ensuring every change is tracked and measured.
- Drive organization wide operational excellence: Act as a thought leader and change agent to drive proactive failure analysis, chaos engineering, and incident reviews to continuously improve system reliability.
- Enable engineering teams: Collaborate with engineering teams and develop processes and tooling that empower those teams to effectively operate their applications.
- Support On-Call culture: Define best practices for on-call rotations, incident response, and escalation policies. The SRE team will help set the standard for operational excellence, fill gaps in on-call coverage, and act as first responders when necessary to ensure critical issues are addressed swiftly.
- Build and lead a high-performing team: Hire, mentor, and manage a global SRE team focused on automation, operational maturity, and platform reliability.
- Develop and Track Reliability Metrics: Define and monitor SLOs, SLIs, and error budgets to align reliability efforts with business needs.
Preferred Qualifications
- Master’s degree preferred