Director of Site Reliability Engineering

5+ years of experience leading SRE teams operating high-scale, cloud-native SaaS products.
7+ years of hands-on SRE experience in fast-paced, high-growth software companies.
Proven experience building and scaling on-call rotations, improving incident management processes, and establishing operational best practices.
Deep expertise in public cloud infrastructure, ideally Azure.
Strong understanding of Kubernetes, Infrastructure as Code (IaC), and modern observability practices (e.g., distributed tracing, metrics, and logging).
Experience implementing secure development practices, CI/CD pipelines, and operational processes in compliance-focused environments
Demonstrated success managing cross-functional teams and collaborating with engineering, support, security, and other stakeholders
Experience presenting to executives in high-pressure situations.
Experience managing vendor relationships and external partnerships
Bachelor’s degree in Computer Science, Information Security, or a related field (Master’s degree preferred)

Define and drive SRE strategy: Establish and implement a vision for reliability, availability, and operational excellence across all VDC systems.
Lead incident and change management: Manage and improve processes to improve incident response, root cause analysis, and change control, ensuring every change is tracked and measured.
Drive organization wide operational excellence: Act as a thought leader and change agent to drive proactive failure analysis, chaos engineering, and incident reviews to continuously improve system reliability.
Enable engineering teams: Collaborate with engineering teams and develop processes and tooling that empower those teams to effectively operate their applications.
Support On-Call culture: Define best practices for on-call rotations, incident response, and escalation policies. The SRE team will help set the standard for operational excellence, fill gaps in on-call coverage, and act as first responders when necessary to ensure critical issues are addressed swiftly.
Build and lead a high-performing team: Hire, mentor, and manage a global SRE team focused on automation, operational maturity, and platform reliability.
Develop and Track Reliability Metrics: Define and monitor SLOs, SLIs, and error budgets to align reliability efforts with business needs.