Research Scientist in ML Systems

Master or above degree in distributed, parallel computing principles and know the recent advances in computing, storage, networking, and hardware technologies
Familiar with machine learning algorithms and platforms
Have basic understanding of how GPU, FPGA, ASIC works
Expert in at least one or two programming languages in Linux environment: C/C++, CUDA, Python

Research and develop our machine learning systems, including heterogeneous computing architecture, management, and monitoring
Deploy the machine learning systems, distributed task scheduling, machine learning training, and machine learning inference
Manage cross-layer optimization of system and AI algorithms and hardware for machine learning (GPU, FPGA, ASIC)

GPU based high performance computing, RDMA high performance network (NCCL)
Tensorflow, Jax, PyTorch or other deep learning frameworks
Large scale data processing and parallel computing
Experiences in designing and operating large scale systems in cloud computing or machine learning