Skip to content

Site Reliability Engineer
Company | Brillio |
---|
Location | St. Louis, MO, USA |
---|
Salary | $55 – $60 |
---|
Type | Full-Time |
---|
Degrees | Bachelor’s |
---|
Experience Level | Junior, Mid Level |
---|
Requirements
- Bachelor’s degree in computer science, Engineering, or a related field (or equivalent experience)
- 2-3 years’ experience as an Observability Engineer or a similar role in a production environment
- Deep understanding of observability principles, methodologies, and tools such as Prometheus, Grafana, Jaeger, ELK stack, etc.
- Proficiency in programming/scripting languages like Java, Python, Go, or similar for automation and tooling development
- Strong knowledge of cloud computing platforms (AWS preferred) and container orchestration systems (e.g., Kubernetes)
- Excellent problem-solving skills and the ability to troubleshoot complex issues in distributed systems
- Strong communication skills and the ability to collaborate effectively with cross-functional teams
Responsibilities
- Design and develop robust observability solutions to monitor, analyze, and troubleshoot distributed systems
- Familiar with OTEL standards and tools
- Previous experience working with application teams to implement ‘self-healing’ i.e. alerting that triggers automated remediation
- Implement and configure monitoring, logging, tracing, and alerting systems to ensure comprehensive coverage of our infrastructure and applications
- Collaborate with software engineers to instrument code for telemetry data collection and analysis
- Optimize observability tooling and processes to improve system reliability, performance, and scalability
- Create dashboards, reports, and visualizations to provide actionable insights into system health and performance
- Investigate and resolve incidents by analyzing telemetry data and identifying root causes
- Stay current with industry trends and best practices in observability and recommend improvements to our observability strategy and infrastructure
Preferred Qualifications