Posted in

Software Engineer – ML System Scheduling

Software Engineer – ML System Scheduling

CompanyByteDance
LocationSeattle, WA, USA
Salary$Not Provided – $Not Provided
TypeFull-Time
Degrees
Experience LevelJunior, Mid Level

Requirements

  • Be proficient in 1 to 2 programming languages such as Go/Python/Shell in Linux environment
  • Be familiar with Kubernetes architecture and container technology such as Docker/Containerd/Kata/Podman, and have rich experience in Machine Learning system practice and development
  • Understand the principles of distributed systems and have experience in the design, development and maintenance of large-scale distributed systems
  • Have an excellent logical analysis ability, able to reasonably abstract and split business logic
  • Have a strong sense of responsibility, good learning ability, communication skills and self-drive, able to respond and act quickly

Responsibilities

  • Responsible for the design and development of resource scheduling, including model training, model evaluation and model inference in various scenarios (LLM/AIGC/CV/Speech, etc.)
  • Responsible for the optimal orchestration of various computing resources (GPU, CPU, other heterogeneous hardware), realizing the rational use of stable resources, tidal resources, mixed resources, and multi-cloud resources
  • Responsible for the optimal combination of computing resources, RDMA high-speed network resources, and storage resources, and giving full play to the power of large-scale distributed clusters
  • Responsible for offline and online workload scheduling in global data centers integrating multi-cloud scenarios to achieve rational distributions

Preferred Qualifications

  • Familiar with at least one major Machine Learning framework (TensorFlow/PyTorch)
  • Experience in one of the following fields: AI Infrastructure, HW/SW Co-Design, High-Performance Computing, ML Hardware Architecture (GPU, Accelerators, Networking)