Posted in

Director of Site Reliability Engineering

Director of Site Reliability Engineering

CompanyVeeam Software
LocationSeattle, WA, USA
Salary$239600 – $342300
TypeFull-Time
DegreesBachelor’s, Master’s
Experience LevelSenior, Expert or higher

Requirements

  • 5+ years of experience leading SRE teams operating high-scale, cloud-native SaaS products.
  • 7+ years of hands-on SRE experience in fast-paced, high-growth software companies.
  • Proven experience building and scaling on-call rotations, improving incident management processes, and establishing operational best practices.
  • Deep expertise in public cloud infrastructure, ideally Azure.
  • Strong understanding of Kubernetes, Infrastructure as Code (IaC), and modern observability practices (e.g., distributed tracing, metrics, and logging).
  • Experience implementing secure development practices, CI/CD pipelines, and operational processes in compliance-focused environments
  • Demonstrated success managing cross-functional teams and collaborating with engineering, support, security, and other stakeholders
  • Experience presenting to executives in high-pressure situations.
  • Experience managing vendor relationships and external partnerships
  • Bachelor’s degree in Computer Science, Information Security, or a related field (Master’s degree preferred)

Responsibilities

  • Define and drive SRE strategy: Establish and implement a vision for reliability, availability, and operational excellence across all VDC systems.
  • Lead incident and change management: Manage and improve processes to improve incident response, root cause analysis, and change control, ensuring every change is tracked and measured.
  • Drive organization wide operational excellence: Act as a thought leader and change agent to drive proactive failure analysis, chaos engineering, and incident reviews to continuously improve system reliability.
  • Enable engineering teams: Collaborate with engineering teams and develop processes and tooling that empower those teams to effectively operate their applications.
  • Support On-Call culture: Define best practices for on-call rotations, incident response, and escalation policies. The SRE team will help set the standard for operational excellence, fill gaps in on-call coverage, and act as first responders when necessary to ensure critical issues are addressed swiftly.
  • Build and lead a high-performing team: Hire, mentor, and manage a global SRE team focused on automation, operational maturity, and platform reliability.
  • Develop and Track Reliability Metrics: Define and monitor SLOs, SLIs, and error budgets to align reliability efforts with business needs.

Preferred Qualifications

  • Master’s degree preferred