Skip to content

Software Engineer – Systemml – AI Networking
Company | Meta |
---|
Location | Menlo Park, CA, USA |
---|
Salary | $85.1 – $251000 |
---|
Type | Full-Time |
---|
Degrees | Bachelor’s, PhD |
---|
Experience Level | Junior, Mid Level |
---|
Requirements
- Bachelor’s degree in Computer Science, Computer Engineering, relevant technical field, or equivalent practical experience
- Proven C/C++ and Python programming skills
- Proven track record of leading successful projects
- Effective leadership and communication skills
- Specialized experience in one or more of the following machine learning/deep learning domains: Distributed ML Training, GPU architecture, ML systems, AI infrastructure, high performance computing, performance optimizations, or Machine Learning frameworks (e.g. PyTorch)
Responsibilities
- Tech-leading the collective communication library development on Meta’s large-scale GPU training infra with a focus on GenAI/LLM scaling
Preferred Qualifications
- PhD in Computer Science, Computer Engineering, or relevant technical field
- Experience with NCCL and distributed GPU performance analysis on RoCE/Infiniband
- Experience working with DL frameworks like PyTorch, Caffe2 or TensorFlow
- Experience with both data parallel and model parallel training, such as Distributed Data Parallel, Fully Sharded Data Parallel (FSDP), Tensor Parallel, and Pipeline Parallel
- Experience in AI framework and trainer development on accelerating large-scale distributed deep learning models
- Experience in HPC and parallel computing
- Knowledge of GPU architectures and CUDA programming
- Knowledge of ML, deep learning and LLM