Senior Engineer – Observability Lead - Cloud

Senior Engineer – Observability Lead – Cloud

Proven experience in a DevOps or observability-focused role, concentrating on production service management and operational excellence.
Prior experience working with microservices in a production environment is a must.
At least 5+ years of experience managing large numbers of instances in public clouds like AWS, Azure, GCP, etc.
Strong expertise in observability practices and tools (e.g., Prometheus, Grafana, Datadog).
Experience enhancing logging, reducing log noise, and integrating critical metrics into services.
Proficiency in building and managing dashboards and monitoring tools.
Expertise in setting up and managing PagerDuty alerts, with on-call rotation and escalation management knowledge.
Strong collaboration skills to work closely with engineering teams, advocating for observability best practices.
Familiarity with cloud platforms (AWS, GCP, Azure) and modern CI/CD processes.
Automation scripting or coding experience (Python, Go, or similar).
Knowledge of infrastructure-as-code tools (e.g., Terraform, CloudFormation).
Excellent problem-solving skills and attention to detail in managing complex systems.

Serve as an advocate for observability practices within the engineering team, promoting operational best practices and reliability.
Catalog all production services, documenting critical details for operational visibility and management.
Collaborate with engineering teams to develop and implement a comprehensive observability plan, ensuring metrics are integrated into all services.
Enhance logging practices where needed, reduce log noise, and ensure meaningful insights are captured.
Add and refine metrics across applications to improve operational visibility and performance tracking.
Develop detailed runbooks for critical alerts and incidents, facilitating efficient response processes.
Build and maintain dashboards that offer insights into SLAs, performance, and business metrics for engineering and product teams.
Set up and manage PagerDuty alerts, define on-call duties, and establish incident escalation paths.
Continuously improve alerting, logging, and monitoring processes to enhance service reliability and reduce unnecessary noise.

No preferred qualifications provided.