Software Engineer – Distributed Training Infrastructure

June 20, 2025June 20, 2025

Software Engineer – Distributed Training Infrastructure

Company	Clockwork Systems
Location	Palo Alto, CA, USA
Salary	$Not Provided – $Not Provided
Type	Full-Time
Degrees
Experience Level	Mid Level

Requirements

Deep experience with PyTorch and torch.distributed (c10d)
Hands-on experience with at least one of: Megatron-LM, DeepSpeed, or FairScale
Proficiency in Python and Linux shell scripting
Experience with multi-node GPU clusters using Slurm, Kubernetes, or similar
Strong understanding of NCCL, collective communication, and GPU topology
Familiarity with debugging tools and techniques for distributed systems

Responsibilities

Develop and support distributed PyTorch training jobs using torch.distributed / c10d
Integrate and maintain frameworks like Megatron-LM, DeepSpeed, and related LLM training stacks
Diagnose and resolve distributed training issues (e.g., NCCL hangs, OOM, checkpoint corruption)
Optimize performance across communication, I/O, and memory bottlenecks
Implement fault tolerance, checkpointing, and recovery mechanisms for long-running jobs
Write tooling and scripts to streamline training workflows and experiment management
Collaborate with ML engineers to ensure compatibility with orchestration and container environments (e.g., Slurm, Kubernetes)

Preferred Qualifications

Experience scaling LLM training across 8+ GPUs and multiple nodes
Knowledge of tensor, pipeline, and data parallelism
Familiarity with containerized training environments (Docker, Singularity)
Exposure to HPC environments or cloud GPU infrastructure
Experience with training workload orchestration tools or custom job launchers
Comfort with large-scale checkpointing, resume/restart logic, and model I/O