Posted in

Software Engineer – Systemml – AI Networking

Software Engineer – Systemml – AI Networking

CompanyMeta
LocationMenlo Park, CA, USA
Salary$85.1 – $251000
TypeFull-Time
DegreesBachelor’s, PhD
Experience LevelJunior, Mid Level

Requirements

  • Bachelor’s degree in Computer Science, Computer Engineering, relevant technical field, or equivalent practical experience
  • Proven C/C++ and Python programming skills
  • Proven track record of leading successful projects
  • Effective leadership and communication skills
  • Specialized experience in one or more of the following machine learning/deep learning domains: Distributed ML Training, GPU architecture, ML systems, AI infrastructure, high performance computing, performance optimizations, or Machine Learning frameworks (e.g. PyTorch)

Responsibilities

  • Tech-leading the collective communication library development on Meta’s large-scale GPU training infra with a focus on GenAI/LLM scaling

Preferred Qualifications

  • PhD in Computer Science, Computer Engineering, or relevant technical field
  • Experience with NCCL and distributed GPU performance analysis on RoCE/Infiniband
  • Experience working with DL frameworks like PyTorch, Caffe2 or TensorFlow
  • Experience with both data parallel and model parallel training, such as Distributed Data Parallel, Fully Sharded Data Parallel (FSDP), Tensor Parallel, and Pipeline Parallel
  • Experience in AI framework and trainer development on accelerating large-scale distributed deep learning models
  • Experience in HPC and parallel computing
  • Knowledge of GPU architectures and CUDA programming
  • Knowledge of ML, deep learning and LLM