Posted in

Research Scientist Intern – Doubao – Seed – Machine Learning System – 2025 Summer – PhD

Research Scientist Intern – Doubao – Seed – Machine Learning System – 2025 Summer – PhD

CompanyByteDance
LocationSeattle, WA, USA
Salary$Not Provided – $Not Provided
TypeInternship
DegreesPhD
Experience LevelInternship

Requirements

  • Currently in PhD program in distributed, parallel computing principles and know the recent advances in computing, storage, networking, and hardware technologies
  • Familiar with machine learning algorithms, platforms and frameworks such as PyTorch and Jax.
  • Have basic understanding of how GPU and/or ASIC works.
  • Expert in at least one or two programming languages in Linux environment: C/C++, CUDA, Python.
  • Must obtain work authorization in country of employment at the time of hire, and maintain ongoing work authorization during employment.

Responsibilities

  • Research and develop our efficient machine learning systems, including efficient optimizers, parameters, and gradient efficient training with rank reduction and communication compression.
  • Develop a state-of-the-art asynchronous training framework ensuring convergence.
  • Implement both general purpose training framework features and model specific optimizations (e.g. LLM, diffusions).
  • Improve efficiency and stability for extremely large scale distributed training jobs.

Preferred Qualifications

  • GPU based high performance computing, RDMA high performance network (MPI, NCCL, ibverbs).
  • Distributed training framework optimizations such as DeepSpeed, FSDP, Megatron, GSPMD.
  • AI compiler stacks such as torch.fx, XLA and MLIR.
  • Large scale data processing and parallel computing.
  • Experiences in designing and operating large scale systems in cloud computing or machine learning.
  • Experiences in in-depth CUDA programming and performance tuning (cutlass, triton).