Senior System Software Engineer – Nccl – Partner Enablement
Company | NVIDIA |
---|---|
Location | Austin, TX, USA, Remote in USA, Santa Clara, CA, USA |
Salary | $148000 – $356500 |
Type | Full-Time |
Degrees | Bachelor’s, Master’s |
Experience Level | Senior |
Requirements
- B.S./M.S. degree in CS/CE or equivalent experience with 5+ years of relevant experience
- Experience with parallel programming and at least one communication runtime (MPI, NCCL, UCX, NVSHMEM)
- Excellent C/C++ programming skills, including debugging, profiling, code optimization, performance analysis, and test design
- Experience working with engineering or academic research community supporting HPC or AI
- Practical experience with high performance networking: Infiniband/RoCE/Ethernet networks, RDMA, topologies, congestion control
- Expert in Linux fundamentals and a scripting language, preferably Python
- Familiar with containers, cloud provisioning and scheduling tools (Docker, Docker Swarm, Kubernetes, SLURM, Ansible)
- Adaptability and passion to learn new areas and tools
- Flexibility to work and communicate effectively across different teams and timezones
Responsibilities
- Engage with our partners and customers to root cause functional and performance issues reported with NCCL
- Conduct performance characterization and analysis of NCCL and DL applications on groundbreaking GPU clusters
- Develop tools and automation to isolate issues on new systems and platforms, including cloud platforms (Azure, AWS, GCP, etc.)
- Guide our customers and support teams on HPC knowledge and standard methodologies for running applications on multi-node clusters
- Document and conduct trainings/webinars for NCCL
- Engage with internal teams in different time zones on networking, GPUs, storage, infrastructure and support.
Preferred Qualifications
- Experience conducting performance benchmarking and developing infrastructure on HPC clusters
- Prior system administration experience, esp for large clusters
- Experience debugging network configuration issues in large scale deployments
- Familiarity with CUDA programming and/or GPUs
- Good understanding of Machine Learning concepts and experience with Deep Learning Frameworks such PyTorch, TensorFlow
- Deep understanding of technology and passionate about what you do