Senior Director – Site Reliability Engineering
Company | Visa |
---|---|
Location | San Mateo, CA, USA |
Salary | $Not Provided – $Not Provided |
Type | Full-Time |
Degrees | Bachelor’s, Master’s, MBA, PharmD, PhD |
Experience Level | Senior, Expert or higher |
Requirements
- 12 or more years of work experience with a Bachelor’s Degree or at least 10 years of work experience with an Advanced degree (e.g. Masters/MBA /JD/MD), or a minimum of 5 years of work experience with a PhD
- Minimum of 10 years in a site reliability engineering role with at least 5 years in a leadership position managing large SRE teams
- Proficiency in system design and architecture, particularly in a cloud environment
- Expertise in automation and orchestration systems like Kubernetes, Terraform, and Ansible
- Strong coding skills in languages such as Go, Python, Ruby, or Java
- Deep understanding of networking concepts and protocols
- Experience with continuous integration and continuous deployment (CI/CD) pipelines and tools
- Proven track record of leading teams through complex system outages and scalability challenges
- Ability to mentor and grow an SRE team, fostering a culture of continuous learning and innovation
- Strong project management skills, with experience in Agile methodologies
- Excellent verbal and written communication abilities
- Proficient in creating technical documentation and system diagrams
- Experience presenting to C-level executives and stakeholders
- Demonstrated experience in incident management and post-mortem analysis
- Commitment to high availability, fault tolerance, and reliability in all aspects of work
- Knowledge of compliance and security best practices in a highly regulated industry
Responsibilities
- Lead and scale the SRE team, setting objectives and key results that align with the company’s strategic goals
- Develop and implement SRE policies, standards, and best practices for enterprise-wide systems
- Define standards for building reliable applications that are highly available and resilient
- Drive the adoption of a DevSecOps culture, fostering collaboration between development and operations teams
- Oversee the design and implementation of solutions for system monitoring, logging, alerting, and incident response
- Collaborate with product development teams to ensure reliability and scalability are considered at the design phase
- Manage on-call rotations, incident management processes, and post-mortem analyses to ensure continuous improvement
- Define Service Level Objectives (SLOs), Service Level Indicators (SLIs), and Error Budgets for all critical services
- Work closely with the security team to ensure compliance with industry standards and regulatory requirements
- Lead initiatives to improve CI/CD pipelines and automate infrastructure provisioning and deployment
- Provide technical leadership and mentorship to team members, encouraging professional growth and technical excellence
Preferred Qualifications
- 15 or more years of experience with a Bachelor’s Degree or 12 years of experience with an Advanced Degree (e.g. Masters, MBA, JD, or MD), PhD with 9+ years of experience in Computer Science, Engineering, or a related technical field
- Certifications in cloud technologies (AWS, GCP, Azure)
- Contributions to open-source projects or public speaking at relevant tech conferences
- Strategic thinker with a vision for the future of SRE within the organization
- Resilient and adaptable in the face of changing technology landscapes
- Collaborative mindset with a focus on cross-functional partnerships