Director of Production Engineering
Company | CoreWeave |
---|---|
Location | Livingston, NJ, USA, New York, NY, USA, Bellevue, WA, USA, Sunnyvale, CA, USA |
Salary | $230000 – $275000 |
Type | Full-Time |
Degrees | Bachelor’s |
Experience Level | Expert or higher |
Requirements
- Bachelor’s degrees in Computer Science, Engineering, or related fields.
- 10+ years of engineering leadership roles within SRE, DevOps, or cloud infrastructure.
- 5+ years in managing large-scale infrastructure-as-service in a geographically distributed, always-on environment.
- Proven success leading 24×7 operations teams and delivering high-availability services at scale.
- Deep expertise in automation, monitoring/observabilities, and incident response frameworks.
- Familiarity with AI purpose-built cloud-native architectures, CI/CD systems, and performance tuning.
Responsibilities
- Define and execute the SRE vision, strategy, and roadmap for a large-scale, distributed cloud infrastructure.
- Lead and mentor a high-performing team of SREs, promoting a culture of ownership, collaboration, and continuous learning.
- Champion automation-first practices, leveraging tools like Terraform, Kubernetes, and Infrastructure-as-Code to minimize toil and manual interventions.
- Establish and evolve best practices in observability, monitoring, and alerting, ensuring the platform is proactive, not reactive.
- Drive initiatives for incident management, postmortem culture, root cause analysis, and system hardening.
- Collaborate with engineering, product, and customer support teams to build scalable, resilient, and self-healing systems.
- Evolve our on-call strategy and processes to support a 24×7, globally distributed platform with minimal disruptions.
Preferred Qualifications
- Hands-on experience with Python, Go, Java, or Ruby for operational tooling and automation.
- Strong track record of hiring, mentoring, and developing top-tier SRE talent in high-growth companies.
- Comfortable navigating cross-functional dynamics and influencing leadership across engineering, product, and support.
- Experience leading DevOps and reliability transformation projects, improving developer velocity and platform resilience.