Senior Full-Stack Software Engineer
Company | NVIDIA |
---|---|
Location | Seattle, WA, USA, Santa Clara, CA, USA |
Salary | $184000 – $356500 |
Type | Full-Time |
Degrees | Bachelor’s, Master’s |
Experience Level | Senior, Expert or higher |
Requirements
- 8+ years of experience in developing software infrastructure for large scale AI systems.
- Bachelor’s degree or higher in Computer Science or a related technical field (or equivalent experience).
- Proficiency with full-stack development: JavaScript (Vue or React), Node.js, Python, and/or Golang, script languages
- Experience with distributed systems and cloud-native technologies (Docker, Kubernetes, microservices)
- Familiarity with observability stacks: ELK, OpenSearch, Prometheus, Grafana, or Loki
- Strong debugging and root cause analysis skills across application and infrastructure layers
- Experience with large-scale AI training, inference, or data infrastructure services
- Excellent communication, collaboration, problem solving and a growth mindset
Responsibilities
- Design, develop, and deploy full-stack web applications to support large-scale AI infrastructure operations and workflows
- Collaborate with AI and ML research teams to identify pain points and deliver tools that accelerate their work
- Develop APIs, backend services, and UIs to improve visibility, observability, and control over large-scale GPU clusters
- Develop backend services to manage job schedulers and cluster operations.
- Define and track metrics that measure efficiency, resiliency, and developer productivity across the platform
- Drive engineering excellence in testing, CI/CD, code quality, and performance
- Lead architectural discussions and mentor junior engineers on design and implementation
- Stay ahead of AI/ML infrastructure trends and drive adoption of best practices within the team
Preferred Qualifications
- Experience building developer platforms or self-service internal infrastructure tools for efficiency, resiliency, or observability.
- Hands-on experience as a Machine Learning Engineer (MLE) or deep familiarity with DL frameworks (e.g., PyTorch, TensorFlow, JAX, Ray).
- Hands-on experience operating at datacenter scale, including GPU cluster debugging and root cause analysis.
- Experience with MongoDB, Hadoop, or Spark.