Posted in

Site Reliability Engineer II

Site Reliability Engineer II

CompanyYahoo
LocationUnited States
Salary$96000 – $200000
TypeFull-Time
DegreesBachelor’s, Master’s
Experience LevelSenior

Requirements

  • Bachelor’s or Master’s degree in Computer Science, Engineering, or 5+ years of experience in DevOps, Site Reliability Engineering (SRE), or Infrastructure Engineering roles.
  • 2+ years of programming experience in Bash, Python, Java or Go.
  • In-depth knowledge of Linux distributions like RedHat and CentOS; Linux certifications (RHCT, RHCE, LPIC) are a plus.
  • Hands-on experience with AWS core services such as EC2, S3, RDS, EKS, Lambda, and networking services like VPC, Route 53, API GW, and Transit Gateway
  • Understanding of containerization and orchestration technologies, especially Kubernetes
  • Strong understanding of networking concepts (DNS, TCP/IP, HTTP/S, Load Balancing) and cloud-native networking in AWS.
  • Experience with CI/CD tools such as GitHub Actions, Jenkins, ArgoCD, Screwdriver
  • An understanding of IaC concepts, specifically using Terraform
  • Ability to troubleshoot & resolve hardware, network and software problems
  • Experience with OSS and / or commercial observability tools like Grafana, NewRelic, DataDog, Splunk, Chronosphere, AWS or GCP native telemetry tools
  • Strong skill set integrating diverse API and Web Services
  • Strong troubleshooting skills with a focus on automation, scalability, and resilience.
  • Excellent communication and interpersonal skills.
  • Strong desire to learn new technologies and systems as part of daily work.

Responsibilities

  • Maintain & Improve comprehensive monitoring, alerting, and logging systems. (ie. OpenTSDB, Grafana, Splunk, Chronosphere, Big Panda, Rootly)
  • Enhance o11y guides & documentation to support ongoing service management operations.
  • Ensure 24/7/365 availability, scalability, and incident response for critical applications.
  • Participate in a global on-call rotation. Troubleshoot, resolve, and document production issues, escalating when necessary.
  • Monitor and report performance, availability, and SLA metrics.
  • Work with development teams to enhance, document, and improve system operability.
  • Develop, configure, and manage Terraform-based Infrastructure as Code (IaC) configurations to automate provisioning, scaling, and management of cloud environments.
  • Build CICD pipelines and iterate on existing chef/ansible templates for application deployments used for OS builds, configurations, or upgrades.
  • Modernize infrastructure by performing OS upgrades & migrating services to Kubernetes
  • Oversee Change management coordination with key-stakeholders
  • Develop and support automation scripts and tools for operational efficiency, leveraging AWS and GCP SDKs and APIs.
  • Provide stakeholders with progress updates on shared initiatives (Email, Jira, Slack, Tickets, GIT, Meetings)
  • Manage situations of moderate complexity and make timely decisions to ensure smooth operations
  • Develop business operations workflows for large applications to meet business needs.

Preferred Qualifications

  • Knowledge and operational experience running large-scale global distributed systems
  • Expert using Terraform as IaC
  • Strong expertise in Splunk Cloud & Open Telemetry
  • Experience managing multi-region, multi-AZ cloud deployments with a focus on disaster recovery and fault tolerance
  • Proficient in Slack, Jira & Confluence