Skip to content

Staff Site Reliability Engineer
Company | Moveworks |
---|
Location | Mountain View, CA, USA |
---|
Salary | $227000 – $290000 |
---|
Type | Full-Time |
---|
Degrees | Bachelor’s |
---|
Experience Level | Senior, Expert or higher |
---|
Requirements
- 7+ years of experience in authoring and operating complex distributed infrastructure and applications
- Strong experience with container orchestration platform like Kubernetes and cloud infrastructure like AWS / GCP / Azure
- Very high proficiency with Unix/Linux, TCP/IP, DNS, load balancers, autoscaling, file systems and different types of data stores
- Software development proficiency with Python, Golang, Java, or C++
- Experience working across teams and implementing solutions, tools, and practices to improve observability, reliability, and scalability
- Desire to work at a startup pace in a small company with a high degree of ownership
- Strong motivation, gumption, and an appetite for continuous, incremental changes and completing challenging projects fast
- High level of curiosity about engineering outside of your immediate discipline and an incessant desire to learn
- BS+ in computer science or a related field
Responsibilities
- Design, develop, and evolve site reliability and chaos engineering for Moveworks infrastructure and services
- Closely work with machine learning, search, product, infrastructure, data, and frontend teams to understand their infrastructure and operational needs and build solutions that are optimal, fault tolerant, and scalable
- Author and advocate for reliability through best distributed system design patterns (error handling, retries, rate limiting, circuit breaking, etc.)
- Participate in design discussions and ensure operational readiness of infrastructure, services, and features
- Design and build tools, libraries, and frameworks that allow engineering teams to rapidly deploy and scale Moveworks infrastructure and applications
- Review and participate in application performance analysis / tuning and capacity planning
- Setup and maintain monitoring, metrics, and reporting systems for observability and actionable alerting
- Define internal and customer-facing key SLA metrics, implement solutions and practices with different teams to improve those metrics
- Own the engineering on-call process and setup
- Drive discussions for outages, root cause analysis, and action items
- Participate in on-call rotation for second-tier escalation (at Moveworks, each engineer participates in the team specific first-tier on-call rotation)
- Help diagnose and resolve complex operational issues
Preferred Qualifications
No preferred qualifications provided.