Posted in

Senior Site Reliability Engineer – Observability – Fedramp Il5

Senior Site Reliability Engineer – Observability – Fedramp Il5

CompanySplunk
LocationTexas, USA, Colorado, USA, North Carolina, USA
Salary$Not Provided – $Not Provided
TypeFull-Time
DegreesBachelor’s, Master’s
Experience LevelSenior, Expert or higher

Requirements

  • 7+ years of SRE experience in handling large-scale cloud-native microservices platforms.
  • 3+ years of strong hands-on experience deploying, handling, and monitoring large-scale Kubernetes clusters in the public cloud specifically AWS or GCP
  • Experience with infrastructure automation and scripting using Python and/or bash scripting.
  • Strong hands-on experience in monitoring tools such as Splunk, Prometheus, Grafana, ELK stack, etc. in order to build observability for large-scale microservices deployments.
  • Experience with deployment, operations, and performance management of one or more of the following large-scale clusters such as Cassandra, Kafka, Elastic Search, MongoDB, ZooKeeper, Redis, etc.
  • Excellent problem-solving, triaging, and debugging skills in large-scale distributed systems
  • Candidate must be a US citizen and must reside on US soil

Responsibilities

  • Design new services, tools, and monitoring to be implemented by the entire team.
  • Analyze the tradeoffs of the proposed design and make recommendations based on these tradeoffs.
  • Mentor new engineers to achieve more than they thought possible.
  • Work on reliability projects, including: HA, Business Continuity Planning, disaster recovery, backup/restore, RTO, RPO, Chaos engineering, Application uptime and performance, Capacity management & planning, SLIs, SLOs, error budgets, and monitoring dashboards.
  • Responsible for deployment and operations of large-scale distributed data stores and streaming services.
  • Establishing design patterns for monitoring and benchmarking.
  • Establishing and documenting production run books and guidelines for developers.
  • Tooling, toil reduction, runbooks & automation to handle production environments.
  • Incident management and improving MTTD/MTTR for services.
  • Cloud cost optimization

Preferred Qualifications

  • AWS Solutions Architect certification preferred.
  • Confluent Certified Administrator for Apache Kafka and/or Apache Cassandra Administrator Associate certifications are preferred.
  • Experience with Infrastructure-as-Code using Terraform, CloudFormation, Google Deployment Manager, Pulumi, Packer, ARM, etc.
  • Experience with CI/CD frameworks and Pipeline-as-Code such as Jenkins, Spinnaker, Gitlab, Argo, Artifactory, etc.
  • Proven skills to effectively work across teams and functions to influence the design, operations, and deployment of highly available software.
  • Experience handling cloud infrastructure and operations in strict security, compliance, and regulatory environments such as FedRAMP.
  • Bachelors/Masters in Computer Science, Engineering, or related technical field, or equivalent practical experience.