Skip to content

Software Engineer – GPU Infrastructure – HPC
Company | OpenAI |
---|
Location | San Francisco, CA, USA |
---|
Salary | $325000 – $590000 |
---|
Type | Full-Time |
---|
Degrees | |
---|
Experience Level | Junior, Mid Level |
---|
Requirements
- Experience managing large-scale server environments.
- Proficiency in Python, Go, or similar languages.
- Strong Linux, networking, and server hardware knowledge.
- Comfort digging into noisy data with SQL, PromQL, and Pandas or any other tool.
Responsibilities
- Build and maintain automation systems for provisioning and managing server fleets.
- Develop tools to monitor server health, performance, and lifecycle events.
- Collaborate with clusters, networking, and infrastructure teams.
- Partner with external operators to ensure a high level of quality.
- Identify and fix performance bottlenecks and inefficiencies.
- Continuously improve automation to reduce manual work.
Preferred Qualifications
- Experience with low level details of hardware components, protocols, and associated Linux tooling (e.g., PCIe, Infiniband, networking, power management, kernel perf tuning)
- Knowledge of hardware management protocols (e.g., IPMI, Redfish).
- High-performance computing (HPC) or distributed systems experience.
- Prior experience developing, managing, or designing hardware.
- Familiarity with monitoring tools (e.g., Prometheus, Grafana).