Site Reliability Engineer II

Bachelor’s or Master’s degree in Computer Science, Engineering, or 5+ years of experience in DevOps, Site Reliability Engineering (SRE), or Infrastructure Engineering roles.
2+ years of programming experience in Bash, Python, Java or Go.
In-depth knowledge of Linux distributions like RedHat and CentOS; Linux certifications (RHCT, RHCE, LPIC) are a plus.
Hands-on experience with AWS core services such as EC2, S3, RDS, EKS, Lambda, and networking services like VPC, Route 53, API GW, and Transit Gateway
Understanding of containerization and orchestration technologies, especially Kubernetes
Strong understanding of networking concepts (DNS, TCP/IP, HTTP/S, Load Balancing) and cloud-native networking in AWS.
Experience with CI/CD tools such as GitHub Actions, Jenkins, ArgoCD, Screwdriver
An understanding of IaC concepts, specifically using Terraform
Ability to troubleshoot & resolve hardware, network and software problems
Experience with OSS and / or commercial observability tools like Grafana, NewRelic, DataDog, Splunk, Chronosphere, AWS or GCP native telemetry tools
Strong skill set integrating diverse API and Web Services
Strong troubleshooting skills with a focus on automation, scalability, and resilience.
Excellent communication and interpersonal skills.
Strong desire to learn new technologies and systems as part of daily work.

Maintain & Improve comprehensive monitoring, alerting, and logging systems. (ie. OpenTSDB, Grafana, Splunk, Chronosphere, Big Panda, Rootly)
Enhance o11y guides & documentation to support ongoing service management operations.
Ensure 24/7/365 availability, scalability, and incident response for critical applications.
Participate in a global on-call rotation. Troubleshoot, resolve, and document production issues, escalating when necessary.
Monitor and report performance, availability, and SLA metrics.
Work with development teams to enhance, document, and improve system operability.
Develop, configure, and manage Terraform-based Infrastructure as Code (IaC) configurations to automate provisioning, scaling, and management of cloud environments.
Build CICD pipelines and iterate on existing chef/ansible templates for application deployments used for OS builds, configurations, or upgrades.
Modernize infrastructure by performing OS upgrades & migrating services to Kubernetes
Oversee Change management coordination with key-stakeholders
Develop and support automation scripts and tools for operational efficiency, leveraging AWS and GCP SDKs and APIs.
Provide stakeholders with progress updates on shared initiatives (Email, Jira, Slack, Tickets, GIT, Meetings)
Manage situations of moderate complexity and make timely decisions to ensure smooth operations
Develop business operations workflows for large applications to meet business needs.

Knowledge and operational experience running large-scale global distributed systems
Expert using Terraform as IaC
Strong expertise in Splunk Cloud & Open Telemetry
Experience managing multi-region, multi-AZ cloud deployments with a focus on disaster recovery and fault tolerance
Proficient in Slack, Jira & Confluence