Posted in

ML Operations Engineer

ML Operations Engineer

CompanyBenevity
LocationToronto, ON, Canada
Salary$Not Provided – $Not Provided
TypeFull-Time
DegreesBachelor’s
Experience LevelMid Level, Senior

Requirements

  • A degree in Computer Science, Engineering, or a related field.
  • 3+ years of experience in DevOps, MLOps, or SRE roles with hands-on responsibility for ML model deployment and lifecycle management.
  • Experience with cloud ML platforms such as AWS SageMaker, GCP Vertex AI, Azure ML, or Databricks.
  • Proficiency in IaC tools (Terraform, CloudFormation) and workflow orchestration (Airflow, Kubeflow, or MLflow).
  • Strong Python skills for scripting, automation, and interaction with ML APIs and orchestration tools.
  • Familiarity with observability tools like Prometheus, Grafana, Datadog, or cloud-native monitoring (CloudWatch, GCP Monitoring, Azure Monitor).
  • Experience implementing CI/CD pipelines for ML using GitHub Actions, Jenkins, ArgoCD, or similar.
  • Solid understanding of data security, model governance, and compliance in the context of ML systems.
  • Ability to diagnose complex issues across infrastructure, models, and data flows.
  • Excellent communication skills and a collaborative mindset to work cross-functionally in scrum teams.

Responsibilities

  • Design and manage cloud-native infrastructure for ML model training, evaluation, deployment, and monitoring on platforms like Azure ML, SageMaker, Vertex AI, or Databricks.
  • Build and maintain Infrastructure-as-Code (IaC) using tools such as Terraform to support reproducible, scalable, and auditable ML deployments.
  • Develop end-to-end MLOps pipelines supporting continuous integration and delivery (CI/CD), model versioning, automated testing, and retraining workflows.
  • Implement observability practices including logging, monitoring, and alerting to ensure model and system performance in production.
  • Optimize infrastructure for cost-efficiency, model latency, throughput, and reliability.
  • Ensure security of ML pipelines and services through authentication, authorization, rate-limiting, and auditing mechanisms.
  • Instrument ML systems with metrics, traces, logs, and dashboards to support performance monitoring and issue detection.
  • Participate in incident management, including on-call rotations, writing operational runbooks, and conducting postmortems to drive continuous improvement.
  • Apply security and compliance best practices to data handling, model outputs, and system operations, aligning with regulatory standards.
  • Work closely with data scientists to move models from experimentation to production.
  • Collaborate with software engineers to integrate ML capabilities into core products such as recommendation engines, personalization, or predictive analytics.
  • Partner with DevOps, Security, and SRE teams to maintain compliance (e.g., SOC2, GDPR) and platform readiness.
  • Engage in architectural reviews and contribute to design decisions around machine learning infrastructure and APIs.
  • Actively participate in scrum ceremonies, including sprint planning, standups, and retrospectives.
  • Provide effort estimates, contribute to backlog grooming, and deliver quality features and improvements in a continuous delivery cycle.
  • Maintain clear documentation of ML infrastructure, processes, and decisions for transparency and collaboration.

Preferred Qualifications

  • Languages: Bash. Bonus: Go, Rust, or Java for backend systems.