Tech Lead Platform Engineer - AI & ML Ops

Tech Lead Platform Engineer – AI & ML Ops

Company	Enable
Location	Toronto, ON, Canada
Salary	$Not Provided – $Not Provided
Type	Full-Time
Degrees	Bachelor’s
Experience Level	Senior, Expert or higher

Requirements

8+ years of experience in platform, DevOps, or cloud engineering, with at least 3+ years in SaaS environments.
Bachelor’s degree in Computer Science, Engineering, or a related field.
Expertise in architecting, deploying, and managing cloud-native applications on AWS, GCP, or Azure.
Strong experience with Kubernetes, serverless computing, and container orchestration.
Proficiency in modern Infrastructure-as-Code (Terraform, Pulumi) and CI/CD tools (GitHub Actions, ArgoCD).
Experience in distributed systems, service mesh technologies (Istio, Linkerd), and event-driven architectures.
Hands-on experience with databases (SQL, NoSQL, vector DBs), ensuring high availability and performance.
Deep understanding of SaaS security principles, identity management, and compliance frameworks (SOC 2, ISO 27001).
Strong programming and automation skills in Python, SQL, and Bash.
Strong experience with microservices architecture and API development.
Familiarity with MLOps frameworks and ML model operationalization.

Responsibilities

Design, build, and optimize a cloud-native, multi-tenant SaaS platform that scales with Enable’s rapid growth.
Develop and maintain core infrastructure components, including compute, networking, observability, and CI/CD pipelines.
Implement best practices for cloud cost optimization, security, and system reliability.
Enhance API gateways, identity management, and service orchestration for seamless integration across services.
Architect and manage scalable AI/ML pipelines for model training, deployment, and monitoring.
Develop and maintain MLOps workflows using tools like Kubeflow and MLflow.
Optimize ML model inference for real-time and batch processing in a production SaaS environment.
Collaborate with data scientists and ML engineers to streamline model lifecycle management.
Implement and refine Infrastructure-as-Code (IaC) practices using Terraform or Pulumi.
Build self-healing, automated monitoring solutions for system health, application performance, and security.
Improve CI/CD processes to support high-velocity engineering teams with minimal operational overhead.
Establish robust logging, tracing, and metrics collection for visibility into SaaS application performance.
Work closely with engineering teams to ensure platform capabilities support product innovation and reliability.
Partner with security teams to ensure compliance with SaaS security and compliance standards and best practices.
Implement monitoring, logging, and alerting solutions to track model and system performance, ensuring compliance with best practices.
Define and document platform engineering best practices to elevate team-wide capabilities.
Mentor junior and mid-level engineers, fostering a culture of technical excellence and continuous learning.

Preferred Qualifications

Experience building and maintaining large-scale data processing pipelines.
Expertise in observability tools such as Prometheus, Grafana, OpenTelemetry, or ELK stack.
Familiarity with real-time data processing and messaging platforms (Kafka, Pub/Sub, Kinesis).
Background in high-availability, globally distributed architectures.
Work experience in cutting-edge AI and MLOps challenges at the intersection of ML, engineering, and cloud infrastructure.
Certifications in cloud computing (e.g., AWS Certified DevOps Engineer, Azure DevOps Engineer).
Contributions to open-source DevOps, platform, or MLOps tooling.