Posted in

Senior Machine Learning Ops Engineer – ML System

Senior Machine Learning Ops Engineer – ML System

CompanyByteDance
LocationSan Jose, CA, USA
Salary$Not Provided – $Not Provided
TypeFull-Time
DegreesBachelor’s
Experience LevelSenior

Requirements

  • Bachelor’s degree or above, major in computer science, computer engineering or related
  • Strong proficiency in at least one programming language such as Go/Python/Shell in Linux environment
  • Strong hands-on experience with Kubernetes and containers skills, and have more than 2 years of relevant operation and maintenance experience
  • Possess excellent logical analysis ability, able to reasonably abstract and split business logic
  • Have good documentation principles and habits to be able to write and update workflow and technical documentation as required on time
  • Possess a strong sense of responsibility, good learning ability, communication ability and self-drive, good team spirit

Responsibilities

  • Responsible for ensuring our ML systems are operating and running efficiently for large model development, training, evaluation, and inference
  • Responsible for the stability of offline tasks/services in multi-data center, multi-region, and multi-cloud scenarios
  • Responsible for resource management and planning, cost and budget, including computing and storage resources
  • Responsible for global system disaster recovery, cluster machine governance, stability of business services, resource utilisation improvement and operation efficiency improvement
  • Build software tools, products and systems to monitor and manage the ML infrastructure and services efficiently
  • Be part of the global team roster that ensures system and business on-call support

Preferred Qualifications

  • Engaged in the operation and maintenance of large-scale ML distributed systems
  • Experience in operation and maintenance of GPU servers