Posted in

Senior Manager – Site Reliability Engineering

Senior Manager – Site Reliability Engineering

CompanySentinelOne
LocationUnited States
Salary$202400 – $278300
TypeFull-Time
Degrees
Experience LevelSenior, Expert or higher

Requirements

  • 8+ years of engineering experience, with at least 4 years in a management role
  • Demonstrated experience leading technical and operational teams at various stages of maturity
  • Excellent analytical and problem-solving skills
  • Familiarity with modern software development methodologies, tools, and techniques including CI/CD
  • Experience working with cloud-native applications and large scale distributed systems including a working knowledge of technologies such as Kubernetes and Terraform/IaC and cloud providers such as AWS or GCP
  • Experience with various monitoring and alerting techniques and tools, including frameworks and concepts such as SLOs, OTel and Golden Signals as well as tooling such as Prometheus and Grafana
  • Extensive experience with incident response and management at various layers of the stack across different business needs and applications, including both hands on experience leading incidents/post-incident analysis and experience driving broader incident management initiatives
  • Ability to thrive in a fast-paced, dynamic environment
  • Driven by curiosity and humility – complex distributed systems are complex, so ask the “silly” question and seek out answers

Responsibilities

  • Grow and lead a team of SRE professionals, including setting performance goals and measuring deliverables against key metrics, while evolving those metrics as S1 grows and needs develop
  • Invest in data-driven deep triage on recurring issues, collaborating with other engineering teams to identify and address issues related to reliability, performance, and capacity
  • Develop, improve, and implement processes for the full incident lifecycle including incident management, post-incident analysis, and learning from incidents
  • Lead incident response efforts, including coordinating with other teams to investigate and resolve customer-impacting incidents
  • Design support model for SRE regarding service maturity and service ownership, including monitoring and alerting improvements and SLI / SLO design and implementation
  • Analyze production metrics and signals to identify areas for improvement and take proactive steps to mitigate issues
  • Develop and implement best practices and standards for Site Reliability Engineering, from day to day operations to hiring and planning
  • Communicate effectively with cross-functional teams to ensure alignment on objectives and priorities. Deliver outcomes, not just stories and tasks.

Preferred Qualifications

    No preferred qualifications provided.