ML Operations Engineer

Company	Benevity
Location	Toronto, ON, Canada
Salary	$Not Provided – $Not Provided
Type	Full-Time
Degrees	Bachelor’s
Experience Level	Mid Level, Senior

Requirements

A degree in Computer Science, Engineering, or a related field.
3+ years of experience in DevOps, MLOps, or SRE roles with hands-on responsibility for ML model deployment and lifecycle management.
Experience with cloud ML platforms such as AWS SageMaker, GCP Vertex AI, Azure ML, or Databricks.
Proficiency in IaC tools (Terraform, CloudFormation) and workflow orchestration (Airflow, Kubeflow, or MLflow).
Strong Python skills for scripting, automation, and interaction with ML APIs and orchestration tools.
Familiarity with observability tools like Prometheus, Grafana, Datadog, or cloud-native monitoring (CloudWatch, GCP Monitoring, Azure Monitor).
Experience implementing CI/CD pipelines for ML using GitHub Actions, Jenkins, ArgoCD, or similar.
Solid understanding of data security, model governance, and compliance in the context of ML systems.
Ability to diagnose complex issues across infrastructure, models, and data flows.
Excellent communication skills and a collaborative mindset to work cross-functionally in scrum teams.

Responsibilities

Design and manage cloud-native infrastructure for ML model training, evaluation, deployment, and monitoring on platforms like Azure ML, SageMaker, Vertex AI, or Databricks.
Build and maintain Infrastructure-as-Code (IaC) using tools such as Terraform to support reproducible, scalable, and auditable ML deployments.
Develop end-to-end MLOps pipelines supporting continuous integration and delivery (CI/CD), model versioning, automated testing, and retraining workflows.
Implement observability practices including logging, monitoring, and alerting to ensure model and system performance in production.
Optimize infrastructure for cost-efficiency, model latency, throughput, and reliability.
Ensure security of ML pipelines and services through authentication, authorization, rate-limiting, and auditing mechanisms.
Instrument ML systems with metrics, traces, logs, and dashboards to support performance monitoring and issue detection.
Participate in incident management, including on-call rotations, writing operational runbooks, and conducting postmortems to drive continuous improvement.
Apply security and compliance best practices to data handling, model outputs, and system operations, aligning with regulatory standards.
Work closely with data scientists to move models from experimentation to production.
Collaborate with software engineers to integrate ML capabilities into core products such as recommendation engines, personalization, or predictive analytics.
Partner with DevOps, Security, and SRE teams to maintain compliance (e.g., SOC2, GDPR) and platform readiness.
Engage in architectural reviews and contribute to design decisions around machine learning infrastructure and APIs.
Actively participate in scrum ceremonies, including sprint planning, standups, and retrospectives.
Provide effort estimates, contribute to backlog grooming, and deliver quality features and improvements in a continuous delivery cycle.
Maintain clear documentation of ML infrastructure, processes, and decisions for transparency and collaboration.

Preferred Qualifications

Languages: Bash. Bonus: Go, Rust, or Java for backend systems.