Skip to content

Site Reliability Engineer – Chaos Engineering
Company | Xero |
---|
Location | San Mateo, CA, USA |
---|
Salary | $185000 – $201700 |
---|
Type | Full-Time |
---|
Degrees | |
---|
Experience Level | Mid Level |
---|
Requirements
- Proficient in programming languages such as Python, Go, Java, C#, C+, .NET for automation and tool development
- Experienced in using chaos engineering tools like Gremlin, Chaos Monkey or Litmus
- Excellent analytical skills to assess system performance and identify weaknesses
- Effective communication skills to collaborate with cross-functional teams and convey complex concepts
- Leadership abilities to drive chaos engineering initiatives and foster a culture of resilience
- Knowledge of cloud platforms (e.g., AWS, Azure, GCP) and container orchestration (e.g., Kubernetes)
- Familiarity with monitoring and observability tools to track system health and performance metrics.
Responsibilities
- Design and implement chaos experiments to identify weaknesses in system architecture and improve overall reliability
- Collaborate with cross-functional teams to develop strategies that enhance system resilience and ensure optimal performance in production environments
- Design and build a failure mode and chaos engineering environment that allows for repeatable and scalable testing
- Develop and maintain chaos engineering frameworks and tools
- Collaborate with development and operations teams to implement improvements based on experiment results
- Monitor system health and performance metrics to assess the impact of chaos experiments
- Educate team members on chaos engineering principles and best practices
- Analyze system behavior during experiments and document findings
- Continuously improve chaos engineering process and methodologies.
Preferred Qualifications
No preferred qualifications provided.