Software Engineer ML Infra Systems – Senior
Company | d-Matrix |
---|---|
Location | Santa Clara, CA, USA |
Salary | $180000 – $280000 |
Type | Full-Time |
Degrees | Bachelor’s, Master’s, PhD |
Experience Level | Senior |
Requirements
- BS in Computer Science, Engineering, Math, Physics, or related degree with 4+ years of industry software development experience and MS in Computer Science, Engineering, Math, Physics, or related degree preferred with 2+ years
- Strong grasp of system software, data structures, computer architecture, and machine learning fundamentals
- Proficient in C/C++/Python development in Linux environment and using standard development tools
- Experience with distributed, high-performance software design and implementation
- Self-motivated team player with a strong sense of ownership and leadership.
Responsibilities
- Be part of the team that helps productize the SW stack for the AI compute engine
- Responsible for the development, enhancement, and maintenance of the next-generation AI deployment software
- Build and scale software deliverables in a tight development window
- Work with a team of system software experts to build out the deployment infrastructure
- Work closely with other software (ML, compilers) and hardware experts in the company.
Preferred Qualifications
- MS or PhD in Computer Science, Electrical Engineering, or related fields
- Experience with inference servers/model serving frameworks (such as TensorRT-LLM, vLLM, SGLang, etc.)
- Experience with deep learning frameworks (such as PyTorch and TensorFlow)
- Experience with deep learning runtimes (such as ONNX Runtime, TensorRT, etc.)
- Experience with distributed systems collectives such as NCCL, OpenMPI
- Experience with software testing fundamentals
- Experience deploying ML workloads (LLMs, VLMs, NLP, etc.) on distributed systems
- Experience with Kubernetes, Ray, or other MLOps tools and techniques used from definition to deployment
- Prior startup, small team, or incubation experience
- Work experience at a cloud provider or AI compute/subsystem company