Senior Software Development Engineer-Mlops
Company | Workday |
---|---|
Location | Toronto, ON, Canada, Beaverton, OR, USA, Atlanta, GA, USA, Vancouver, BC, Canada |
Salary | $132800 – $199200 |
Type | Full-Time |
Degrees | Bachelor’s, Master’s |
Experience Level | Senior |
Requirements
- Solid experience as a Software Development Engineer in ML domain- 5+ years’ experience with a Master’s or higher or 6+ years with a Bachelor’s in Computer Science or Computer Engineering or equivalent
- 4 year’s experience designing, implementing, and maintaining robust MLOps services for deploying, monitoring, and scaling machine learning development primarily using Kubeflow or similar platforms
- Professional experience in building web applications and microservices and API design
- Solid understanding on how to implement and manage CI/CD workflows to automate testing, integration, and delivery of machine learning components
- Experience in supporting large Kubernetes networks in production
- 6 or more years of cloud programming experience preferably in Python or Go
- Experience with running and maintaining ML platforms such as: Databricks, Sagemaker, and or VertexAI
Responsibilities
- Work with multi-functional teams to deliver scalable, secure and reliable solutions
- Building MLOps platform primarily using Kubeflow, and other ML ecosystem framework and services for building a unified ML Development experience
- Effectively engage with data scientists, ML engineers, PMs and architects in requirements elaboration and drive technical solutions
- Own and develop cloud-based services from end to end including infrastructure as code
- Design and build software solutions for efficient organization, storage and retrieval of data to enable substantial scale
- Understanding cloud computing and security to build robust cloud infrastructure and solutions for ML teams
- Build systems and dashboards to monitor service & ML health
- Lead in architecture reviews, code reviews and technology evaluation
- Research, evaluate, prototype and drive adoption of new ML tools with reliability and scale in mind
Preferred Qualifications
- Implementation and operation of distributed systems
- Stay abreast of industry trends and emerging technologies, providing recommendations for continuous improvement of our DevOps and machine learning practices
- Troubleshoot and resolve performance bottlenecks, system outages, and other operational issues in collaboration with the ML engineering teams
- Ensure the security and compliance of machine learning platforms, implementing best practices for encryption, data protection and access controls
- Optimize public cloud-based infrastructure (AWS, GCP) to support the computational requirements of machine learning workloads
- Experience in managing relevant tools like Databricks and Sagemaker to perform efficient computation and management of large-scale data lakes
- Experience of data and/or ML systems with ability to think across layers of the stack
- Develop and maintain monitoring and alerting systems for proactively identifying and addressing issues within the machine learning infrastructure
- Experience in leading or mentoring other team members