Site Reliability Engineer
Company | Caddi |
---|---|
Location | Chicago, IL, USA |
Salary | $Not Provided – $Not Provided |
Type | Full-Time |
Degrees | Bachelor’s |
Experience Level | Mid Level, Senior |
Requirements
- Bachelor’s degree in Computer Science, Engineering, or equivalent experience
- 4+ years in Site Reliability Engineering, DevOps, or Systems Engineering with cloud-based SaaS platforms
- Deep Terraform and Infrastructure as Code expertise with security best practices
- Proficiency in Python and other scripting/programming languages
- Modern CI/CD experience (Github Actions, GitLab CI, Jenkins, ArgoCD, Spinnaker) including AI/ML workloads
- Strong cloud platform experience, preferably GCP (AWS, Azure experience also valuable for future multi-cloud deployments)
- Experience building and optimizing containers (Docker) and configuring orchestration (Kubernetes)
- Monitoring tools experience (Datadog, Prometheus, Grafana, etc.)
- Regulated industry experience (Aerospace & Defense, Finance, Healthcare) with experience building secure platforms
- DevSecOps principles and security integration experience
- Security-first development mindset with understanding of secure infrastructure practices
- Strong problem-solving and communication skills for distributed team environments
Responsibilities
- Design, implement, and operate highly available, scalable, and fault-tolerant infrastructure primarily on GCP, but to include multi-cloud deployments
- Lead Terraform-based infrastructure development with security best practices, encrypted state management, and governance tools
- Build robust pipelines supporting hundreds of developers and AI engineers
- Integrate automated security testing, vulnerability scanning, and compliance checks throughout the development lifecycle
- Implement comprehensive observability strategies using Prometheus, Grafana, and ELK
- Define SLOs/SLIs, manage error budgets, and lead incident response with blameless post-mortems
- Navigate complex regulatory requirements for U.S. Aerospace and Defense Industrial Base
- Collaborate with security and legal teams on expanding compliance standards
- Reduce operational toil through Python, Go, or Bash automation
- Work in a follow-the-sun model with global teams while taking primary responsibility for US platform partition incidents and operations
Preferred Qualifications
- Hyper-growth startup experience
- AI Safety experience
- MLOps and AI/ML infrastructure security experience