Site Reliability Engineer

Bachelor’s degree in computer science, Engineering, or a related field (or equivalent experience)
2-3 years’ experience as an Observability Engineer or a similar role in a production environment
Deep understanding of observability principles, methodologies, and tools such as Prometheus, Grafana, Jaeger, ELK stack, etc.
Proficiency in programming/scripting languages like Java, Python, Go, or similar for automation and tooling development
Strong knowledge of cloud computing platforms (AWS preferred) and container orchestration systems (e.g., Kubernetes)
Excellent problem-solving skills and the ability to troubleshoot complex issues in distributed systems
Strong communication skills and the ability to collaborate effectively with cross-functional teams

Design and develop robust observability solutions to monitor, analyze, and troubleshoot distributed systems
Familiar with OTEL standards and tools
Previous experience working with application teams to implement ‘self-healing’ i.e. alerting that triggers automated remediation
Implement and configure monitoring, logging, tracing, and alerting systems to ensure comprehensive coverage of our infrastructure and applications
Collaborate with software engineers to instrument code for telemetry data collection and analysis
Optimize observability tooling and processes to improve system reliability, performance, and scalability
Create dashboards, reports, and visualizations to provide actionable insights into system health and performance
Investigate and resolve incidents by analyzing telemetry data and identifying root causes
Stay current with industry trends and best practices in observability and recommend improvements to our observability strategy and infrastructure