Skip to content

Site Reliability Engineer II
Company | Yahoo |
---|
Location | United States |
---|
Salary | $96000 – $200000 |
---|
Type | Full-Time |
---|
Degrees | Bachelor’s, Master’s |
---|
Experience Level | Senior |
---|
Requirements
- Bachelor’s or Master’s degree in Computer Science, Engineering, or 5+ years of experience in DevOps, Site Reliability Engineering (SRE), or Infrastructure Engineering roles.
- 2+ years of programming experience in Bash, Python, Java or Go.
- In-depth knowledge of Linux distributions like RedHat and CentOS; Linux certifications (RHCT, RHCE, LPIC) are a plus.
- Hands-on experience with AWS core services such as EC2, S3, RDS, EKS, Lambda, and networking services like VPC, Route 53, API GW, and Transit Gateway
- Understanding of containerization and orchestration technologies, especially Kubernetes
- Strong understanding of networking concepts (DNS, TCP/IP, HTTP/S, Load Balancing) and cloud-native networking in AWS.
- Experience with CI/CD tools such as GitHub Actions, Jenkins, ArgoCD, Screwdriver
- An understanding of IaC concepts, specifically using Terraform
- Ability to troubleshoot & resolve hardware, network and software problems
- Experience with OSS and / or commercial observability tools like Grafana, NewRelic, DataDog, Splunk, Chronosphere, AWS or GCP native telemetry tools
- Strong skill set integrating diverse API and Web Services
- Strong troubleshooting skills with a focus on automation, scalability, and resilience.
- Excellent communication and interpersonal skills.
- Strong desire to learn new technologies and systems as part of daily work.
Responsibilities
- Maintain & Improve comprehensive monitoring, alerting, and logging systems. (ie. OpenTSDB, Grafana, Splunk, Chronosphere, Big Panda, Rootly)
- Enhance o11y guides & documentation to support ongoing service management operations.
- Ensure 24/7/365 availability, scalability, and incident response for critical applications.
- Participate in a global on-call rotation. Troubleshoot, resolve, and document production issues, escalating when necessary.
- Monitor and report performance, availability, and SLA metrics.
- Work with development teams to enhance, document, and improve system operability.
- Develop, configure, and manage Terraform-based Infrastructure as Code (IaC) configurations to automate provisioning, scaling, and management of cloud environments.
- Build CICD pipelines and iterate on existing chef/ansible templates for application deployments used for OS builds, configurations, or upgrades.
- Modernize infrastructure by performing OS upgrades & migrating services to Kubernetes
- Oversee Change management coordination with key-stakeholders
- Develop and support automation scripts and tools for operational efficiency, leveraging AWS and GCP SDKs and APIs.
- Provide stakeholders with progress updates on shared initiatives (Email, Jira, Slack, Tickets, GIT, Meetings)
- Manage situations of moderate complexity and make timely decisions to ensure smooth operations
- Develop business operations workflows for large applications to meet business needs.
Preferred Qualifications
- Knowledge and operational experience running large-scale global distributed systems
- Expert using Terraform as IaC
- Strong expertise in Splunk Cloud & Open Telemetry
- Experience managing multi-region, multi-AZ cloud deployments with a focus on disaster recovery and fault tolerance
- Proficient in Slack, Jira & Confluence