Posted in

Senior System Software Engineer – Nccl – Partner Enablement

Senior System Software Engineer – Nccl – Partner Enablement

CompanyNVIDIA
LocationAustin, TX, USA, Remote in USA, Santa Clara, CA, USA
Salary$148000 – $356500
TypeFull-Time
DegreesBachelor’s, Master’s
Experience LevelSenior

Requirements

  • B.S./M.S. degree in CS/CE or equivalent experience with 5+ years of relevant experience
  • Experience with parallel programming and at least one communication runtime (MPI, NCCL, UCX, NVSHMEM)
  • Excellent C/C++ programming skills, including debugging, profiling, code optimization, performance analysis, and test design
  • Experience working with engineering or academic research community supporting HPC or AI
  • Practical experience with high performance networking: Infiniband/RoCE/Ethernet networks, RDMA, topologies, congestion control
  • Expert in Linux fundamentals and a scripting language, preferably Python
  • Familiar with containers, cloud provisioning and scheduling tools (Docker, Docker Swarm, Kubernetes, SLURM, Ansible)
  • Adaptability and passion to learn new areas and tools
  • Flexibility to work and communicate effectively across different teams and timezones

Responsibilities

  • Engage with our partners and customers to root cause functional and performance issues reported with NCCL
  • Conduct performance characterization and analysis of NCCL and DL applications on groundbreaking GPU clusters
  • Develop tools and automation to isolate issues on new systems and platforms, including cloud platforms (Azure, AWS, GCP, etc.)
  • Guide our customers and support teams on HPC knowledge and standard methodologies for running applications on multi-node clusters
  • Document and conduct trainings/webinars for NCCL
  • Engage with internal teams in different time zones on networking, GPUs, storage, infrastructure and support.

Preferred Qualifications

  • Experience conducting performance benchmarking and developing infrastructure on HPC clusters
  • Prior system administration experience, esp for large clusters
  • Experience debugging network configuration issues in large scale deployments
  • Familiarity with CUDA programming and/or GPUs
  • Good understanding of Machine Learning concepts and experience with Deep Learning Frameworks such PyTorch, TensorFlow
  • Deep understanding of technology and passionate about what you do