Senior Site Reliability Engineer – Observability – Fedramp Il5
Company | Splunk |
---|---|
Location | Texas, USA, Colorado, USA, North Carolina, USA |
Salary | $Not Provided – $Not Provided |
Type | Full-Time |
Degrees | Bachelor’s, Master’s |
Experience Level | Senior, Expert or higher |
Requirements
- 7+ years of SRE experience in handling large-scale cloud-native microservices platforms.
- 3+ years of strong hands-on experience deploying, handling, and monitoring large-scale Kubernetes clusters in the public cloud specifically AWS or GCP
- Experience with infrastructure automation and scripting using Python and/or bash scripting.
- Strong hands-on experience in monitoring tools such as Splunk, Prometheus, Grafana, ELK stack, etc. in order to build observability for large-scale microservices deployments.
- Experience with deployment, operations, and performance management of one or more of the following large-scale clusters such as Cassandra, Kafka, Elastic Search, MongoDB, ZooKeeper, Redis, etc.
- Excellent problem-solving, triaging, and debugging skills in large-scale distributed systems
- Candidate must be a US citizen and must reside on US soil
Responsibilities
- Design new services, tools, and monitoring to be implemented by the entire team.
- Analyze the tradeoffs of the proposed design and make recommendations based on these tradeoffs.
- Mentor new engineers to achieve more than they thought possible.
- Work on reliability projects, including: HA, Business Continuity Planning, disaster recovery, backup/restore, RTO, RPO, Chaos engineering, Application uptime and performance, Capacity management & planning, SLIs, SLOs, error budgets, and monitoring dashboards.
- Responsible for deployment and operations of large-scale distributed data stores and streaming services.
- Establishing design patterns for monitoring and benchmarking.
- Establishing and documenting production run books and guidelines for developers.
- Tooling, toil reduction, runbooks & automation to handle production environments.
- Incident management and improving MTTD/MTTR for services.
- Cloud cost optimization
Preferred Qualifications
- AWS Solutions Architect certification preferred.
- Confluent Certified Administrator for Apache Kafka and/or Apache Cassandra Administrator Associate certifications are preferred.
- Experience with Infrastructure-as-Code using Terraform, CloudFormation, Google Deployment Manager, Pulumi, Packer, ARM, etc.
- Experience with CI/CD frameworks and Pipeline-as-Code such as Jenkins, Spinnaker, Gitlab, Argo, Artifactory, etc.
- Proven skills to effectively work across teams and functions to influence the design, operations, and deployment of highly available software.
- Experience handling cloud infrastructure and operations in strict security, compliance, and regulatory environments such as FedRAMP.
- Bachelors/Masters in Computer Science, Engineering, or related technical field, or equivalent practical experience.