Senior Machine Learning Ops Engineer – ML System
Company | ByteDance |
---|---|
Location | San Jose, CA, USA |
Salary | $Not Provided – $Not Provided |
Type | Full-Time |
Degrees | Bachelor’s |
Experience Level | Senior |
Requirements
- Bachelor’s degree or above, major in computer science, computer engineering or related
- Strong proficiency in at least one programming language such as Go/Python/Shell in Linux environment
- Strong hands-on experience with Kubernetes and containers skills, and have more than 2 years of relevant operation and maintenance experience
- Possess excellent logical analysis ability, able to reasonably abstract and split business logic
- Have good documentation principles and habits to be able to write and update workflow and technical documentation as required on time
- Possess a strong sense of responsibility, good learning ability, communication ability and self-drive, good team spirit
Responsibilities
- Responsible for ensuring our ML systems are operating and running efficiently for large model development, training, evaluation, and inference
- Responsible for the stability of offline tasks/services in multi-data center, multi-region, and multi-cloud scenarios
- Responsible for resource management and planning, cost and budget, including computing and storage resources
- Responsible for global system disaster recovery, cluster machine governance, stability of business services, resource utilisation improvement and operation efficiency improvement
- Build software tools, products and systems to monitor and manage the ML infrastructure and services efficiently
- Be part of the global team roster that ensures system and business on-call support
Preferred Qualifications
- Engaged in the operation and maintenance of large-scale ML distributed systems
- Experience in operation and maintenance of GPU servers