Senior SRE Engineer

Combined minimum of 6 years’ higher education and/or work experience in systems design, management and/or architecture
5+ years of experience in Site Reliability Engineering, DevOps or system design and/or architecture similar roles
3+ years of experience leading or managing observability initiatives
Strong hands-on experience with monitoring tools like Kibana, Dynatrace, Datadog, or similar
Solid understanding of observability concepts (metrics, logging, tracing, alerting) and frameworks (e.g., OpenTelemetry)
Experience with cloud environments such as AWS, Google Cloud, or Azure
Familiarity with containerization (Docker, Kubernetes) and orchestration platforms
Excellent problem-solving skills and ability to troubleshoot complex distributed systems
Mid-level programming skills in Python, Jason, PowerShell, or other relevant languages
Experience with incident response and post-mortem analysis
Excellent communication and collaboration skills
Advanced analytical skills
Advanced troubleshooting skills
Advanced problem solving skills

Lead the development and implementation of observability tools and practices across multiple platforms, including monitoring, logging, tracing, and alerting
Work closely with product and engineering teams to define observability standards, goals, and best practices
Design and optimize the architecture of observability infrastructure to provide clear insights into the health, performance, and scalability of services
Troubleshoot and diagnose complex issues related to performance and availability, offering actionable insights and solutions
Mentor and guide junior SREs on observability tools and practices, fostering a culture of reliability and proactive monitoring
Manage incidents and post-incident reviews to continuously improve monitoring systems and practices
Partner with DevOps, Software Engineers, and other stakeholders to ensure seamless integration of observability tools with CI/CD pipelines
Implement and maintain high-availability monitoring and alerting systems
Ensure automation of observability tooling to scale with the growth of systems and services