Machine Learning Performance Engineer
Company | Jane Street |
---|---|
Location | New York, NY, USA |
Salary | $Not Provided – $Not Provided |
Type | Full-Time |
Degrees | |
Experience Level | Mid Level, Senior |
Requirements
- An understanding of modern ML techniques and toolsets
- The experience and systems knowledge required to debug a training run’s performance end-to-end
- Low-level GPU knowledge of PTX, SASS, warps, cooperative groups, Tensor Cores, and the memory hierarchy
- Debugging and optimization experience using tools like CUDA GDB, Nsight Systems, Nsight Computesight-systems, and nsight-compute
- Library knowledge of Triton, CUTLASS, CUB, Thrust, cuDNN, and cuBLAS
- Intuition about the latency and throughput characteristics of CUDA graph launch, tensor core arithmetic, warp-level synchronization, and asynchronous memory loads
- Background in Infiniband, RoCE, GPUDirect, PXN, rail optimization and NVLink, and how to use these networking technologies to link up GPU clusters
- An understanding of the collective algorithms supporting distributed GPU training in NCCL or MPI
- An inventive approach and the willingness to ask hard questions about whether we’re taking the right approaches and using the right tools
Responsibilities
- Optimizing the performance of models—both training and inference
- Improving straightforward CUDA
- Ensuring the platform makes sense at the lowest level
- Diving deep into market data
- Tuning hyperparameters
- Debugging distributed training performance
- Studying how the model likes to trade in production
Preferred Qualifications
-
No preferred qualifications provided.