Posted in

Site Reliability Engineer – Chaos Engineering

Site Reliability Engineer – Chaos Engineering

CompanyXero
LocationSan Mateo, CA, USA
Salary$185000 – $201700
TypeFull-Time
Degrees
Experience LevelMid Level

Requirements

  • Proficient in programming languages such as Python, Go, Java, C#, C+, .NET for automation and tool development
  • Experienced in using chaos engineering tools like Gremlin, Chaos Monkey or Litmus
  • Excellent analytical skills to assess system performance and identify weaknesses
  • Effective communication skills to collaborate with cross-functional teams and convey complex concepts
  • Leadership abilities to drive chaos engineering initiatives and foster a culture of resilience
  • Knowledge of cloud platforms (e.g., AWS, Azure, GCP) and container orchestration (e.g., Kubernetes)
  • Familiarity with monitoring and observability tools to track system health and performance metrics.

Responsibilities

  • Design and implement chaos experiments to identify weaknesses in system architecture and improve overall reliability
  • Collaborate with cross-functional teams to develop strategies that enhance system resilience and ensure optimal performance in production environments
  • Design and build a failure mode and chaos engineering environment that allows for repeatable and scalable testing
  • Develop and maintain chaos engineering frameworks and tools
  • Collaborate with development and operations teams to implement improvements based on experiment results
  • Monitor system health and performance metrics to assess the impact of chaos experiments
  • Educate team members on chaos engineering principles and best practices
  • Analyze system behavior during experiments and document findings
  • Continuously improve chaos engineering process and methodologies.

Preferred Qualifications

    No preferred qualifications provided.