Software Engineer in ML Systems Graduate – Aml – Machine Learning Systems – 2025 Start – BS/MS
Company | ByteDance |
---|---|
Location | San Jose, CA, USA |
Salary | $Not Provided – $Not Provided |
Type | Full-Time |
Degrees | Bachelor’s, Master’s |
Experience Level | Entry Level/New Grad, Junior |
Requirements
- BS/MS degree, with knowledge in distributed, parallel computing principles and know the recent advances in computing, storage, networking, and hardware technologies
- Familiar with machine learning algorithms, platforms, and frameworks such as PyTorch and Jax
- Have basic understanding of how GPU and/or ASIC works
- Expert in at least one or two programming languages in Linux environment: C/C++, CUDA, Python
Responsibilities
- Research and develop our machine learning systems, including heterogeneous computing architecture, management, and monitoring
- Deploy machine learning systems, distributed task scheduling, machine learning training
- Manage cross-layer optimization of system and AI algorithms and hardware for machine learning (GPU, ASIC)
- Implement both general purpose training framework features and model specific optimizations (e.g. LLM, diffusions)
- Improve efficiency and stability for extremely large scale distributed training jobs
Preferred Qualifications
- GPU based high performance computing, RDMA high performance network (MPI, NCCL, ibverbs)
- Distributed training framework optimizations such as DeepSpeed, FSDP, Megatron, GSPMD
- AI compiler stacks such as torch.fx, XLA and MLIR
- Experiences in designing and operating large-scale systems in cloud computing or machine learning
- Experiences in in-depth CUDA programming and performance tuning (cutlass, triton)