Posted in

Staff Site Reliability Engineer

Staff Site Reliability Engineer

CompanyMoveworks
LocationMountain View, CA, USA
Salary$227000 – $290000
TypeFull-Time
DegreesBachelor’s
Experience LevelSenior, Expert or higher

Requirements

  • 7+ years of experience in authoring and operating complex distributed infrastructure and applications
  • Strong experience with container orchestration platform like Kubernetes and cloud infrastructure like AWS / GCP / Azure
  • Very high proficiency with Unix/Linux, TCP/IP, DNS, load balancers, autoscaling, file systems and different types of data stores
  • Software development proficiency with Python, Golang, Java, or C++
  • Experience working across teams and implementing solutions, tools, and practices to improve observability, reliability, and scalability
  • Desire to work at a startup pace in a small company with a high degree of ownership
  • Strong motivation, gumption, and an appetite for continuous, incremental changes and completing challenging projects fast
  • High level of curiosity about engineering outside of your immediate discipline and an incessant desire to learn
  • BS+ in computer science or a related field

Responsibilities

  • Design, develop, and evolve site reliability and chaos engineering for Moveworks infrastructure and services
  • Closely work with machine learning, search, product, infrastructure, data, and frontend teams to understand their infrastructure and operational needs and build solutions that are optimal, fault tolerant, and scalable
  • Author and advocate for reliability through best distributed system design patterns (error handling, retries, rate limiting, circuit breaking, etc.)
  • Participate in design discussions and ensure operational readiness of infrastructure, services, and features
  • Design and build tools, libraries, and frameworks that allow engineering teams to rapidly deploy and scale Moveworks infrastructure and applications
  • Review and participate in application performance analysis / tuning and capacity planning
  • Setup and maintain monitoring, metrics, and reporting systems for observability and actionable alerting
  • Define internal and customer-facing key SLA metrics, implement solutions and practices with different teams to improve those metrics
  • Own the engineering on-call process and setup
  • Drive discussions for outages, root cause analysis, and action items
  • Participate in on-call rotation for second-tier escalation (at Moveworks, each engineer participates in the team specific first-tier on-call rotation)
  • Help diagnose and resolve complex operational issues

Preferred Qualifications

    No preferred qualifications provided.