Posted in

Senior Director – Site Reliability Engineering

Senior Director – Site Reliability Engineering

CompanyVisa
LocationSan Mateo, CA, USA
Salary$Not Provided – $Not Provided
TypeFull-Time
DegreesBachelor’s, Master’s, MBA, PharmD, PhD
Experience LevelSenior, Expert or higher

Requirements

  • 12 or more years of work experience with a Bachelor’s Degree or at least 10 years of work experience with an Advanced degree (e.g. Masters/MBA /JD/MD), or a minimum of 5 years of work experience with a PhD
  • Minimum of 10 years in a site reliability engineering role with at least 5 years in a leadership position managing large SRE teams
  • Proficiency in system design and architecture, particularly in a cloud environment
  • Expertise in automation and orchestration systems like Kubernetes, Terraform, and Ansible
  • Strong coding skills in languages such as Go, Python, Ruby, or Java
  • Deep understanding of networking concepts and protocols
  • Experience with continuous integration and continuous deployment (CI/CD) pipelines and tools
  • Proven track record of leading teams through complex system outages and scalability challenges
  • Ability to mentor and grow an SRE team, fostering a culture of continuous learning and innovation
  • Strong project management skills, with experience in Agile methodologies
  • Excellent verbal and written communication abilities
  • Proficient in creating technical documentation and system diagrams
  • Experience presenting to C-level executives and stakeholders
  • Demonstrated experience in incident management and post-mortem analysis
  • Commitment to high availability, fault tolerance, and reliability in all aspects of work
  • Knowledge of compliance and security best practices in a highly regulated industry

Responsibilities

  • Lead and scale the SRE team, setting objectives and key results that align with the company’s strategic goals
  • Develop and implement SRE policies, standards, and best practices for enterprise-wide systems
  • Define standards for building reliable applications that are highly available and resilient
  • Drive the adoption of a DevSecOps culture, fostering collaboration between development and operations teams
  • Oversee the design and implementation of solutions for system monitoring, logging, alerting, and incident response
  • Collaborate with product development teams to ensure reliability and scalability are considered at the design phase
  • Manage on-call rotations, incident management processes, and post-mortem analyses to ensure continuous improvement
  • Define Service Level Objectives (SLOs), Service Level Indicators (SLIs), and Error Budgets for all critical services
  • Work closely with the security team to ensure compliance with industry standards and regulatory requirements
  • Lead initiatives to improve CI/CD pipelines and automate infrastructure provisioning and deployment
  • Provide technical leadership and mentorship to team members, encouraging professional growth and technical excellence

Preferred Qualifications

  • 15 or more years of experience with a Bachelor’s Degree or 12 years of experience with an Advanced Degree (e.g. Masters, MBA, JD, or MD), PhD with 9+ years of experience in Computer Science, Engineering, or a related technical field
  • Certifications in cloud technologies (AWS, GCP, Azure)
  • Contributions to open-source projects or public speaking at relevant tech conferences
  • Strategic thinker with a vision for the future of SRE within the organization
  • Resilient and adaptable in the face of changing technology landscapes
  • Collaborative mindset with a focus on cross-functional partnerships