Posted in

Software Engineer – Fleet Health Instrumentation Intern

Software Engineer – Fleet Health Instrumentation Intern

CompanyNVIDIA
LocationSanta Clara, CA, USA
Salary$18 – $71
TypeInternship
DegreesBachelor’s, Master’s
Experience LevelInternship

Requirements

  • Actively pursuing a BS or MS in Computer Science, Computer Engineering, or a closely related quantitative field (e.g., Physics or Mathematics)
  • Solid understanding of distributed‑systems fundamentals, modern software‑engineering practices, and data‑modeling principles
  • Proficiency in at least one programming language—preferably Python or Go
  • Working knowledge of Linux, basic networking concepts, and Kubernetes container orchestration.

Responsibilities

  • Design and build software that collects, transforms, and publishes health data about our global GPU fleet.
  • Develop micro-services and data pipelines in Go or Python that ingest and normalize data from many diverse sources—routing millions of records per day (Kafka, Airflow, Kinesis).
  • Instrument production infrastructure and workloads running on Kubernetes and bare-metal clusters; add tracing and metrics hooks for deeper insights.
  • Automate deployments and testing with CI/CD (GitLab, Argo) and IaC (Terraform), ensuring repeatable, low-touch releases.
  • Participate in the full lifecycle of cloud services—from design docs and code reviews through deployment, monitoring, and continuous improvement.
  • Collaborate with other engineers to debug live issues and turn post-incident insights into durable code fixes.
  • Contribute to internal tooling and dashboards that help engineers visualize fleet health, utilization, and capacity trends.

Preferred Qualifications

  • A systematic, analytical problem‑solving approach paired with clear written and verbal communication skills and a strong sense of ownership.
  • Demonstrated ability to debug, optimize, and automate code or workflows with minimal guidance.
  • Hands‑on experience building, deploying, and operating services in a public‑cloud or large on‑prem environment.