Senior Site Reliability Engineer - Observability - Fedramp

Senior Site Reliability Engineer – Observability – Fedramp

Extensive experience as a Linux system administrator supporting enterprise computing platforms and systems.
Expertise in public cloud (AWS, GCP, Azure) and container orchestration tools (Kubernetes, Docker).
Knowledge and understanding of OpenTelemetry.
Deep understanding of logging, monitoring, tracing, and alerting practices in large-scale distributed systems.
Proficiency with programming languages like Python along with shell scripting to automate tasks
Experience supporting customer facing SaaS infrastructure or similar cloud related services.
Experience in administering or architecting distributed Splunk and Observability environments.
Experience in setting up SLOs & SLIs.

Support and build Splunk’s large scale Cloud offering.
Work with a diverse, geographically distributed team to deliver an excellent product and extraordinary customer experience.
Build and run distributed systems at scale in production, understanding the challenges and trade-offs involved.
Automate processes where possible.
Apply knowledge of best practices related to security, performance, and disaster recovery.
Identify performance bottlenecks, spot anomalous system behavior, and determine the root cause of incidents.
Monitor cloud environments using tools like Splunk, VictorOps, and SignalFx.
Ensure good documentation to facilitate team function.
Tackle complex problems, resolve operational issues, and interact with vendors for solutions.
Handle critical, customer-facing issues and prioritize quickly during escalations.

No preferred qualifications provided.