Senior Site Reliability Engineer - Observability - Fedramp Il5

Senior Site Reliability Engineer – Observability – Fedramp Il5

Company	Splunk
Location	Texas, USA, Colorado, USA, North Carolina, USA
Salary	$Not Provided – $Not Provided
Type	Full-Time
Degrees	Bachelor’s, Master’s
Experience Level	Senior, Expert or higher

7+ years of SRE experience in handling large-scale cloud-native microservices platforms.
3+ years of strong hands-on experience deploying, handling, and monitoring large-scale Kubernetes clusters in the public cloud specifically AWS or GCP
Experience with infrastructure automation and scripting using Python and/or bash scripting.
Strong hands-on experience in monitoring tools such as Splunk, Prometheus, Grafana, ELK stack, etc. in order to build observability for large-scale microservices deployments.
Experience with deployment, operations, and performance management of one or more of the following large-scale clusters such as Cassandra, Kafka, Elastic Search, MongoDB, ZooKeeper, Redis, etc.
Excellent problem-solving, triaging, and debugging skills in large-scale distributed systems
Candidate must be a US citizen and must reside on US soil

Design new services, tools, and monitoring to be implemented by the entire team.
Analyze the tradeoffs of the proposed design and make recommendations based on these tradeoffs.
Mentor new engineers to achieve more than they thought possible.
Work on reliability projects, including: HA, Business Continuity Planning, disaster recovery, backup/restore, RTO, RPO, Chaos engineering, Application uptime and performance, Capacity management & planning, SLIs, SLOs, error budgets, and monitoring dashboards.
Responsible for deployment and operations of large-scale distributed data stores and streaming services.
Establishing design patterns for monitoring and benchmarking.
Establishing and documenting production run books and guidelines for developers.
Tooling, toil reduction, runbooks & automation to handle production environments.
Incident management and improving MTTD/MTTR for services.
Cloud cost optimization

AWS Solutions Architect certification preferred.
Confluent Certified Administrator for Apache Kafka and/or Apache Cassandra Administrator Associate certifications are preferred.
Experience with Infrastructure-as-Code using Terraform, CloudFormation, Google Deployment Manager, Pulumi, Packer, ARM, etc.
Experience with CI/CD frameworks and Pipeline-as-Code such as Jenkins, Spinnaker, Gitlab, Argo, Artifactory, etc.
Proven skills to effectively work across teams and functions to influence the design, operations, and deployment of highly available software.
Experience handling cloud infrastructure and operations in strict security, compliance, and regulatory environments such as FedRAMP.
Bachelors/Masters in Computer Science, Engineering, or related technical field, or equivalent practical experience.