Staff Site Reliability Engineer

7+ years of experience in authoring and operating complex distributed infrastructure and applications
Strong experience with container orchestration platform like Kubernetes and cloud infrastructure like AWS / GCP / Azure
Very high proficiency with Unix/Linux, TCP/IP, DNS, load balancers, autoscaling, file systems and different types of data stores
Software development proficiency with Python, Golang, Java, or C++
Experience working across teams and implementing solutions, tools, and practices to improve observability, reliability, and scalability
Desire to work at a startup pace in a small company with a high degree of ownership
Strong motivation, gumption, and an appetite for continuous, incremental changes and completing challenging projects fast
High level of curiosity about engineering outside of your immediate discipline and an incessant desire to learn
BS+ in computer science or a related field

Design, develop, and evolve site reliability and chaos engineering for Moveworks infrastructure and services
Closely work with machine learning, search, product, infrastructure, data, and frontend teams to understand their infrastructure and operational needs and build solutions that are optimal, fault tolerant, and scalable
Author and advocate for reliability through best distributed system design patterns (error handling, retries, rate limiting, circuit breaking, etc.)
Participate in design discussions and ensure operational readiness of infrastructure, services, and features
Design and build tools, libraries, and frameworks that allow engineering teams to rapidly deploy and scale Moveworks infrastructure and applications
Review and participate in application performance analysis / tuning and capacity planning
Setup and maintain monitoring, metrics, and reporting systems for observability and actionable alerting
Define internal and customer-facing key SLA metrics, implement solutions and practices with different teams to improve those metrics
Own the engineering on-call process and setup
Drive discussions for outages, root cause analysis, and action items
Participate in on-call rotation for second-tier escalation (at Moveworks, each engineer participates in the team specific first-tier on-call rotation)
Help diagnose and resolve complex operational issues

No preferred qualifications provided.