Skip to content

Senior Engineer – Observability Lead – Cloud
Company | Illumio |
---|
Location | Sunnyvale, CA, USA |
---|
Salary | $164000 – $189000 |
---|
Type | Full-Time |
---|
Degrees | |
---|
Experience Level | Senior |
---|
Requirements
- Proven experience in a DevOps or observability-focused role, concentrating on production service management and operational excellence.
- Prior experience working with microservices in a production environment is a must.
- At least 5+ years of experience managing large numbers of instances in public clouds like AWS, Azure, GCP, etc.
- Strong expertise in observability practices and tools (e.g., Prometheus, Grafana, Datadog).
- Experience enhancing logging, reducing log noise, and integrating critical metrics into services.
- Proficiency in building and managing dashboards and monitoring tools.
- Expertise in setting up and managing PagerDuty alerts, with on-call rotation and escalation management knowledge.
- Strong collaboration skills to work closely with engineering teams, advocating for observability best practices.
- Familiarity with cloud platforms (AWS, GCP, Azure) and modern CI/CD processes.
- Automation scripting or coding experience (Python, Go, or similar).
- Knowledge of infrastructure-as-code tools (e.g., Terraform, CloudFormation).
- Excellent problem-solving skills and attention to detail in managing complex systems.
Responsibilities
- Serve as an advocate for observability practices within the engineering team, promoting operational best practices and reliability.
- Catalog all production services, documenting critical details for operational visibility and management.
- Collaborate with engineering teams to develop and implement a comprehensive observability plan, ensuring metrics are integrated into all services.
- Enhance logging practices where needed, reduce log noise, and ensure meaningful insights are captured.
- Add and refine metrics across applications to improve operational visibility and performance tracking.
- Develop detailed runbooks for critical alerts and incidents, facilitating efficient response processes.
- Build and maintain dashboards that offer insights into SLAs, performance, and business metrics for engineering and product teams.
- Set up and manage PagerDuty alerts, define on-call duties, and establish incident escalation paths.
- Continuously improve alerting, logging, and monitoring processes to enhance service reliability and reduce unnecessary noise.
Preferred Qualifications
No preferred qualifications provided.