Software Engineer – Distributed Training Infrastructure
Company | Clockwork Systems |
---|---|
Location | Palo Alto, CA, USA |
Salary | $Not Provided – $Not Provided |
Type | Full-Time |
Degrees | |
Experience Level | Mid Level |
Requirements
- Deep experience with PyTorch and torch.distributed (c10d)
- Hands-on experience with at least one of: Megatron-LM, DeepSpeed, or FairScale
- Proficiency in Python and Linux shell scripting
- Experience with multi-node GPU clusters using Slurm, Kubernetes, or similar
- Strong understanding of NCCL, collective communication, and GPU topology
- Familiarity with debugging tools and techniques for distributed systems
Responsibilities
- Develop and support distributed PyTorch training jobs using torch.distributed / c10d
- Integrate and maintain frameworks like Megatron-LM, DeepSpeed, and related LLM training stacks
- Diagnose and resolve distributed training issues (e.g., NCCL hangs, OOM, checkpoint corruption)
- Optimize performance across communication, I/O, and memory bottlenecks
- Implement fault tolerance, checkpointing, and recovery mechanisms for long-running jobs
- Write tooling and scripts to streamline training workflows and experiment management
- Collaborate with ML engineers to ensure compatibility with orchestration and container environments (e.g., Slurm, Kubernetes)
Preferred Qualifications
- Experience scaling LLM training across 8+ GPUs and multiple nodes
- Knowledge of tensor, pipeline, and data parallelism
- Familiarity with containerized training environments (Docker, Singularity)
- Exposure to HPC environments or cloud GPU infrastructure
- Experience with training workload orchestration tools or custom job launchers
- Comfort with large-scale checkpointing, resume/restart logic, and model I/O