Senior Software Development Engineer-Mlops

Company	Workday
Location	Toronto, ON, Canada, Beaverton, OR, USA, Atlanta, GA, USA, Vancouver, BC, Canada
Salary	$132800 – $199200
Type	Full-Time
Degrees	Bachelor’s, Master’s
Experience Level	Senior

Solid experience as a Software Development Engineer in ML domain- 5+ years’ experience with a Master’s or higher or 6+ years with a Bachelor’s in Computer Science or Computer Engineering or equivalent
4 year’s experience designing, implementing, and maintaining robust MLOps services for deploying, monitoring, and scaling machine learning development primarily using Kubeflow or similar platforms
Professional experience in building web applications and microservices and API design
Solid understanding on how to implement and manage CI/CD workflows to automate testing, integration, and delivery of machine learning components
Experience in supporting large Kubernetes networks in production
6 or more years of cloud programming experience preferably in Python or Go
Experience with running and maintaining ML platforms such as: Databricks, Sagemaker, and or VertexAI

Work with multi-functional teams to deliver scalable, secure and reliable solutions
Building MLOps platform primarily using Kubeflow, and other ML ecosystem framework and services for building a unified ML Development experience
Effectively engage with data scientists, ML engineers, PMs and architects in requirements elaboration and drive technical solutions
Own and develop cloud-based services from end to end including infrastructure as code
Design and build software solutions for efficient organization, storage and retrieval of data to enable substantial scale
Understanding cloud computing and security to build robust cloud infrastructure and solutions for ML teams
Build systems and dashboards to monitor service & ML health
Lead in architecture reviews, code reviews and technology evaluation
Research, evaluate, prototype and drive adoption of new ML tools with reliability and scale in mind

Implementation and operation of distributed systems
Stay abreast of industry trends and emerging technologies, providing recommendations for continuous improvement of our DevOps and machine learning practices
Troubleshoot and resolve performance bottlenecks, system outages, and other operational issues in collaboration with the ML engineering teams
Ensure the security and compliance of machine learning platforms, implementing best practices for encryption, data protection and access controls
Optimize public cloud-based infrastructure (AWS, GCP) to support the computational requirements of machine learning workloads
Experience in managing relevant tools like Databricks and Sagemaker to perform efficient computation and management of large-scale data lakes
Experience of data and/or ML systems with ability to think across layers of the stack
Develop and maintain monitoring and alerting systems for proactively identifying and addressing issues within the machine learning infrastructure
Experience in leading or mentoring other team members