Skip to content

Senior Manager – Site Reliability Engineering
Company | SentinelOne |
---|
Location | United States |
---|
Salary | $202400 – $278300 |
---|
Type | Full-Time |
---|
Degrees | |
---|
Experience Level | Senior, Expert or higher |
---|
Requirements
- 8+ years of engineering experience, with at least 4 years in a management role
- Demonstrated experience leading technical and operational teams at various stages of maturity
- Excellent analytical and problem-solving skills
- Familiarity with modern software development methodologies, tools, and techniques including CI/CD
- Experience working with cloud-native applications and large scale distributed systems including a working knowledge of technologies such as Kubernetes and Terraform/IaC and cloud providers such as AWS or GCP
- Experience with various monitoring and alerting techniques and tools, including frameworks and concepts such as SLOs, OTel and Golden Signals as well as tooling such as Prometheus and Grafana
- Extensive experience with incident response and management at various layers of the stack across different business needs and applications, including both hands on experience leading incidents/post-incident analysis and experience driving broader incident management initiatives
- Ability to thrive in a fast-paced, dynamic environment
- Driven by curiosity and humility – complex distributed systems are complex, so ask the “silly” question and seek out answers
Responsibilities
- Grow and lead a team of SRE professionals, including setting performance goals and measuring deliverables against key metrics, while evolving those metrics as S1 grows and needs develop
- Invest in data-driven deep triage on recurring issues, collaborating with other engineering teams to identify and address issues related to reliability, performance, and capacity
- Develop, improve, and implement processes for the full incident lifecycle including incident management, post-incident analysis, and learning from incidents
- Lead incident response efforts, including coordinating with other teams to investigate and resolve customer-impacting incidents
- Design support model for SRE regarding service maturity and service ownership, including monitoring and alerting improvements and SLI / SLO design and implementation
- Analyze production metrics and signals to identify areas for improvement and take proactive steps to mitigate issues
- Develop and implement best practices and standards for Site Reliability Engineering, from day to day operations to hiring and planning
- Communicate effectively with cross-functional teams to ensure alignment on objectives and priorities. Deliver outcomes, not just stories and tasks.
Preferred Qualifications
No preferred qualifications provided.