Software Engineer – ML System Scheduling
Company | ByteDance |
---|---|
Location | Seattle, WA, USA |
Salary | $Not Provided – $Not Provided |
Type | Full-Time |
Degrees | |
Experience Level | Junior, Mid Level |
Requirements
- Be proficient in 1 to 2 programming languages such as Go/Python/Shell in Linux environment
- Be familiar with Kubernetes architecture and container technology such as Docker/Containerd/Kata/Podman, and have rich experience in Machine Learning system practice and development
- Understand the principles of distributed systems and have experience in the design, development and maintenance of large-scale distributed systems
- Have an excellent logical analysis ability, able to reasonably abstract and split business logic
- Have a strong sense of responsibility, good learning ability, communication skills and self-drive, able to respond and act quickly
Responsibilities
- Responsible for the design and development of resource scheduling, including model training, model evaluation and model inference in various scenarios (LLM/AIGC/CV/Speech, etc.)
- Responsible for the optimal orchestration of various computing resources (GPU, CPU, other heterogeneous hardware), realizing the rational use of stable resources, tidal resources, mixed resources, and multi-cloud resources
- Responsible for the optimal combination of computing resources, RDMA high-speed network resources, and storage resources, and giving full play to the power of large-scale distributed clusters
- Responsible for offline and online workload scheduling in global data centers integrating multi-cloud scenarios to achieve rational distributions
Preferred Qualifications
- Familiar with at least one major Machine Learning framework (TensorFlow/PyTorch)
- Experience in one of the following fields: AI Infrastructure, HW/SW Co-Design, High-Performance Computing, ML Hardware Architecture (GPU, Accelerators, Networking)