Posted in

Software Engineer in ML Systems Graduate – Aml – Machine Learning Systems – 2025 Start – BS/MS

Software Engineer in ML Systems Graduate – Aml – Machine Learning Systems – 2025 Start – BS/MS

CompanyByteDance
LocationSan Jose, CA, USA
Salary$Not Provided – $Not Provided
TypeFull-Time
DegreesBachelor’s, Master’s
Experience LevelEntry Level/New Grad, Junior

Requirements

  • BS/MS degree, with knowledge in distributed, parallel computing principles and know the recent advances in computing, storage, networking, and hardware technologies
  • Familiar with machine learning algorithms, platforms, and frameworks such as PyTorch and Jax
  • Have basic understanding of how GPU and/or ASIC works
  • Expert in at least one or two programming languages in Linux environment: C/C++, CUDA, Python

Responsibilities

  • Research and develop our machine learning systems, including heterogeneous computing architecture, management, and monitoring
  • Deploy machine learning systems, distributed task scheduling, machine learning training
  • Manage cross-layer optimization of system and AI algorithms and hardware for machine learning (GPU, ASIC)
  • Implement both general purpose training framework features and model specific optimizations (e.g. LLM, diffusions)
  • Improve efficiency and stability for extremely large scale distributed training jobs

Preferred Qualifications

  • GPU based high performance computing, RDMA high performance network (MPI, NCCL, ibverbs)
  • Distributed training framework optimizations such as DeepSpeed, FSDP, Megatron, GSPMD
  • AI compiler stacks such as torch.fx, XLA and MLIR
  • Experiences in designing and operating large-scale systems in cloud computing or machine learning
  • Experiences in in-depth CUDA programming and performance tuning (cutlass, triton)