ML Operations Engineer
Company | Benevity |
---|---|
Location | Toronto, ON, Canada |
Salary | $Not Provided – $Not Provided |
Type | Full-Time |
Degrees | Bachelor’s |
Experience Level | Mid Level, Senior |
Requirements
- A degree in Computer Science, Engineering, or a related field.
- 3+ years of experience in DevOps, MLOps, or SRE roles with hands-on responsibility for ML model deployment and lifecycle management.
- Experience with cloud ML platforms such as AWS SageMaker, GCP Vertex AI, Azure ML, or Databricks.
- Proficiency in IaC tools (Terraform, CloudFormation) and workflow orchestration (Airflow, Kubeflow, or MLflow).
- Strong Python skills for scripting, automation, and interaction with ML APIs and orchestration tools.
- Familiarity with observability tools like Prometheus, Grafana, Datadog, or cloud-native monitoring (CloudWatch, GCP Monitoring, Azure Monitor).
- Experience implementing CI/CD pipelines for ML using GitHub Actions, Jenkins, ArgoCD, or similar.
- Solid understanding of data security, model governance, and compliance in the context of ML systems.
- Ability to diagnose complex issues across infrastructure, models, and data flows.
- Excellent communication skills and a collaborative mindset to work cross-functionally in scrum teams.
Responsibilities
- Design and manage cloud-native infrastructure for ML model training, evaluation, deployment, and monitoring on platforms like Azure ML, SageMaker, Vertex AI, or Databricks.
- Build and maintain Infrastructure-as-Code (IaC) using tools such as Terraform to support reproducible, scalable, and auditable ML deployments.
- Develop end-to-end MLOps pipelines supporting continuous integration and delivery (CI/CD), model versioning, automated testing, and retraining workflows.
- Implement observability practices including logging, monitoring, and alerting to ensure model and system performance in production.
- Optimize infrastructure for cost-efficiency, model latency, throughput, and reliability.
- Ensure security of ML pipelines and services through authentication, authorization, rate-limiting, and auditing mechanisms.
- Instrument ML systems with metrics, traces, logs, and dashboards to support performance monitoring and issue detection.
- Participate in incident management, including on-call rotations, writing operational runbooks, and conducting postmortems to drive continuous improvement.
- Apply security and compliance best practices to data handling, model outputs, and system operations, aligning with regulatory standards.
- Work closely with data scientists to move models from experimentation to production.
- Collaborate with software engineers to integrate ML capabilities into core products such as recommendation engines, personalization, or predictive analytics.
- Partner with DevOps, Security, and SRE teams to maintain compliance (e.g., SOC2, GDPR) and platform readiness.
- Engage in architectural reviews and contribute to design decisions around machine learning infrastructure and APIs.
- Actively participate in scrum ceremonies, including sprint planning, standups, and retrospectives.
- Provide effort estimates, contribute to backlog grooming, and deliver quality features and improvements in a continuous delivery cycle.
- Maintain clear documentation of ML infrastructure, processes, and decisions for transparency and collaboration.
Preferred Qualifications
- Languages: Bash. Bonus: Go, Rust, or Java for backend systems.