Head of Site Reliability Engineering

8+ years of experience in infrastructure, DevOps, or SRE roles with increasing responsibility
Proven experience scaling distributed systems in a high-availability, production environment
Expertise with Kubernetes, Terraform, containerization, and at least one major cloud provider (AWS preferred)
Strong knowledge of system design, networking, and reliability principles
Experience with observability tools (e.g., Prometheus, Grafana, Datadog) and incident response practices
Strong leadership and communication skills, with a hands-on, collaborative approach

Build and lead the SRE function at Shakudo, setting goals, technical direction, and driving team culture
Own uptime, reliability, and incident response for our platform
Architect scalable infrastructure using Kubernetes, cloud-native tooling, and automation frameworks
Lead the design of observability, monitoring, and alerting systems to proactively detect and prevent issues
Create and enforce best practices for CI/CD, disaster recovery, and service-level objectives (SLOs)
Partner closely with engineering and product to ensure new features are reliable and production-ready
Mentor engineers and help instill a culture of operational excellence

Experience supporting data pipelines, ML workloads, or complex orchestration systems
Familiarity with the data/ML tooling ecosystem (e.g., Airflow, dbt, Spark, Dremio, etc.)
Previous experience in a startup or high-growth environment