Staff Site Reliability Engineer – Platform
Company | Gemini |
---|---|
Location | Seattle, WA, USA, New York, NY, USA |
Salary | $168000 – $312000 |
Type | Full-Time |
Degrees | |
Experience Level | Senior, Expert or higher |
Requirements
- 7+ years using monitoring, alerting, and automation tooling to understand and remediate performance and health issues in systems at scale
- Good knowledge for various cloud technology providers like AWS, GCP, or Azure
- Experience in a code-first environment, developing automated solutions to solve support and operational issues
- Experience as a Technical Leader within a team, helping evaluating and making tech decisions for the team
- Experience working with containerization such as Nomad, EKS (k8s), Docker, etc.
- Experience working with Configuration Management such as Ansible, Chef, Puppet
- Experience writing scripts or cli tools that help increase Developer Productivity in high-level languages like Python, Go, etc.
- Experience analyzing system and application performance, identifying bottlenecks, and recommending architectural or systemic improvements
- Experience working with Engineering teams, teaching, training, and mentoring on how to implement best-practice technical solutions
- Experience working in a code-drive, automation-first public cloud infrastructure (Terraform)
Responsibilities
- Provide primary operational support and engineering for various Gemini services
- Improve reliability, quality and time-to-market across all Gemini services and offerings
- Guide engineering teams onto the various supported services provided by Platform
- Run on-going performance evaluations and improvements for Gemini systems
- Provide architecture recommendations and engagement as part of SDLC
- Create ‘Production-ready Scorecards’ to evaluate the health of systems pre-launch
- Implement and teaching monitoring, alerting and automated resolution best practices
- Define SLIs, SLOs with Engineering teams
- Educate and guide Engineering teams on reliability and resiliency best practices, like statelessness, chaos testing, blue/green deployments etc.
- Build operational tooling and automations
Preferred Qualifications
-
No preferred qualifications provided.