Senior Machine Learning Ops Engineer - ML System

Senior Machine Learning Ops Engineer – ML System

Bachelor’s degree or above, major in computer science, computer engineering or related
Strong proficiency in at least one programming language such as Go/Python/Shell in Linux environment
Strong hands-on experience with Kubernetes and containers skills, and have more than 2 years of relevant operation and maintenance experience
Possess excellent logical analysis ability, able to reasonably abstract and split business logic
Have good documentation principles and habits to be able to write and update workflow and technical documentation as required on time
Possess a strong sense of responsibility, good learning ability, communication ability and self-drive, good team spirit

Responsible for ensuring our ML systems are operating and running efficiently for large model development, training, evaluation, and inference
Responsible for the stability of offline tasks/services in multi-data center, multi-region, and multi-cloud scenarios
Responsible for resource management and planning, cost and budget, including computing and storage resources
Responsible for global system disaster recovery, cluster machine governance, stability of business services, resource utilisation improvement and operation efficiency improvement
Build software tools, products and systems to monitor and manage the ML infrastructure and services efficiently
Be part of the global team roster that ensures system and business on-call support