Posted in

Head of Site Reliability Engineering

Head of Site Reliability Engineering

CompanyShakudo
LocationToronto, ON, Canada
Salary$Not Provided – $Not Provided
TypeFull-Time
Degrees
Experience LevelSenior, Expert or higher

Requirements

  • 8+ years of experience in infrastructure, DevOps, or SRE roles with increasing responsibility
  • Proven experience scaling distributed systems in a high-availability, production environment
  • Expertise with Kubernetes, Terraform, containerization, and at least one major cloud provider (AWS preferred)
  • Strong knowledge of system design, networking, and reliability principles
  • Experience with observability tools (e.g., Prometheus, Grafana, Datadog) and incident response practices
  • Strong leadership and communication skills, with a hands-on, collaborative approach

Responsibilities

  • Build and lead the SRE function at Shakudo, setting goals, technical direction, and driving team culture
  • Own uptime, reliability, and incident response for our platform
  • Architect scalable infrastructure using Kubernetes, cloud-native tooling, and automation frameworks
  • Lead the design of observability, monitoring, and alerting systems to proactively detect and prevent issues
  • Create and enforce best practices for CI/CD, disaster recovery, and service-level objectives (SLOs)
  • Partner closely with engineering and product to ensure new features are reliable and production-ready
  • Mentor engineers and help instill a culture of operational excellence

Preferred Qualifications

  • Experience supporting data pipelines, ML workloads, or complex orchestration systems
  • Familiarity with the data/ML tooling ecosystem (e.g., Airflow, dbt, Spark, Dremio, etc.)
  • Previous experience in a startup or high-growth environment