Posted in

Site Reliability Engineer

Site Reliability Engineer

CompanyCaddi
LocationChicago, IL, USA
Salary$Not Provided – $Not Provided
TypeFull-Time
DegreesBachelor’s
Experience LevelMid Level, Senior

Requirements

  • Bachelor’s degree in Computer Science, Engineering, or equivalent experience
  • 4+ years in Site Reliability Engineering, DevOps, or Systems Engineering with cloud-based SaaS platforms
  • Deep Terraform and Infrastructure as Code expertise with security best practices
  • Proficiency in Python and other scripting/programming languages
  • Modern CI/CD experience (Github Actions, GitLab CI, Jenkins, ArgoCD, Spinnaker) including AI/ML workloads
  • Strong cloud platform experience, preferably GCP (AWS, Azure experience also valuable for future multi-cloud deployments)
  • Experience building and optimizing containers (Docker) and configuring orchestration (Kubernetes)
  • Monitoring tools experience (Datadog, Prometheus, Grafana, etc.)
  • Regulated industry experience (Aerospace & Defense, Finance, Healthcare) with experience building secure platforms
  • DevSecOps principles and security integration experience
  • Security-first development mindset with understanding of secure infrastructure practices
  • Strong problem-solving and communication skills for distributed team environments

Responsibilities

  • Design, implement, and operate highly available, scalable, and fault-tolerant infrastructure primarily on GCP, but to include multi-cloud deployments
  • Lead Terraform-based infrastructure development with security best practices, encrypted state management, and governance tools
  • Build robust pipelines supporting hundreds of developers and AI engineers
  • Integrate automated security testing, vulnerability scanning, and compliance checks throughout the development lifecycle
  • Implement comprehensive observability strategies using Prometheus, Grafana, and ELK
  • Define SLOs/SLIs, manage error budgets, and lead incident response with blameless post-mortems
  • Navigate complex regulatory requirements for U.S. Aerospace and Defense Industrial Base
  • Collaborate with security and legal teams on expanding compliance standards
  • Reduce operational toil through Python, Go, or Bash automation
  • Work in a follow-the-sun model with global teams while taking primary responsibility for US platform partition incidents and operations

Preferred Qualifications

  • Hyper-growth startup experience
  • AI Safety experience
  • MLOps and AI/ML infrastructure security experience