Posted in

Senior Engineer – Observability Lead – Cloud

Senior Engineer – Observability Lead – Cloud

CompanyIllumio
LocationSunnyvale, CA, USA
Salary$164000 – $189000
TypeFull-Time
Degrees
Experience LevelSenior

Requirements

  • Proven experience in a DevOps or observability-focused role, concentrating on production service management and operational excellence.
  • Prior experience working with microservices in a production environment is a must.
  • At least 5+ years of experience managing large numbers of instances in public clouds like AWS, Azure, GCP, etc.
  • Strong expertise in observability practices and tools (e.g., Prometheus, Grafana, Datadog).
  • Experience enhancing logging, reducing log noise, and integrating critical metrics into services.
  • Proficiency in building and managing dashboards and monitoring tools.
  • Expertise in setting up and managing PagerDuty alerts, with on-call rotation and escalation management knowledge.
  • Strong collaboration skills to work closely with engineering teams, advocating for observability best practices.
  • Familiarity with cloud platforms (AWS, GCP, Azure) and modern CI/CD processes.
  • Automation scripting or coding experience (Python, Go, or similar).
  • Knowledge of infrastructure-as-code tools (e.g., Terraform, CloudFormation).
  • Excellent problem-solving skills and attention to detail in managing complex systems.

Responsibilities

  • Serve as an advocate for observability practices within the engineering team, promoting operational best practices and reliability.
  • Catalog all production services, documenting critical details for operational visibility and management.
  • Collaborate with engineering teams to develop and implement a comprehensive observability plan, ensuring metrics are integrated into all services.
  • Enhance logging practices where needed, reduce log noise, and ensure meaningful insights are captured.
  • Add and refine metrics across applications to improve operational visibility and performance tracking.
  • Develop detailed runbooks for critical alerts and incidents, facilitating efficient response processes.
  • Build and maintain dashboards that offer insights into SLAs, performance, and business metrics for engineering and product teams.
  • Set up and manage PagerDuty alerts, define on-call duties, and establish incident escalation paths.
  • Continuously improve alerting, logging, and monitoring processes to enhance service reliability and reduce unnecessary noise.

Preferred Qualifications

    No preferred qualifications provided.