Site Reliability Engineer - SRE

Site Reliability Engineer – SRE

Bachelors degree in Software Engineering or Software Development
8+ years of experience as an SRE, DevOps Engineer, or Systems Engineer
Strong expertise in Kubernetes (TalOS preferred), cloud platforms (AWS, GCP, Azure), and Linux
Hands-on experience with monitoring, logging, and incident management tools
Proficiency in Python, Bash, or Go for scripting and automation
Experience with building and maintaining lab environments, including physical and virtual infrastructure
Solid knowledge of networking, distributed systems, and performance optimization
Familiarity with CI/CD workflows and Infrastructure as Code practices
Strong communication skills and ability to work cross-functionally

Design, build, and maintain resilient infrastructure across cloud and Kubernetes (TalOS-based) environments
Build and maintain lab infrastructure for development, testing, and validation, including networking, hardware integration, and automation
Define and monitor SLIs, SLOs, and error budgets to guide reliability efforts
Develop automation tools and scripts in Python, Bash, or Go to reduce manual toil and improve system operations
Improve observability using Prometheus, Grafana, OpenTelemetry, and other monitoring/logging solutions
Manage incident response, perform root cause analysis, and lead postmortem processes
Optimize systems for performance, scalability, and fault tolerance
Contribute to infrastructure as code (IaC) using Terraform, Ansible, or Helm
Collaborate with engineering teams to ensure systems are designed for operational excellence