Senior Technical Program Manager – AI/ML & Data Infrastructure – Central Technology
Company | Chan Zuckerberg Initiative |
---|---|
Location | San Carlos, CA, USA |
Salary | $178000 – $267000 |
Type | Full-Time |
Degrees | |
Experience Level | Senior, Expert or higher |
Requirements
- 7+ years of experience in technical program management or infrastructure-focused operations in complex engineering environments.
- Proven ability to manage large-scale technical programs across multiple stakeholders and teams.
- High-level understanding of machine learning workflows and model training pipelines, with the ability to translate infrastructure needs between research and engineering teams.
- Strong organizational skills and experience leading cross-functional programs with tight timelines and multiple stakeholders.
- Excellent written and verbal communication skills, including the ability to align stakeholders at multiple levels.
- A passion for building efficient, secure, and inclusive systems to support cutting-edge science and research.
- Familiarity with on-prem/HPC and/or multi cloud-based GPU infrastructure, orchestration tools, and platforms like Slurm, Run:AI, MLflow, W&B or similar systems is a huge plus.
Responsibilities
- Lead AI/ML infrastructure programs: Drive execution of technical initiatives across GPU scheduling, platform enablement, observability, or workload orchestration.
- Lead access and lifecycle workflows: Own the end-to-end experience for users accessing shared infrastructure resources—including onboarding, offboarding, documentation, and support processes.
- Coordinate infrastructure access requests: Manage intake and operational workflows for machine learning infrastructure access, including triage, tracking, and communication.
- Drive documentation systems: Own the structure, accuracy, and governance of internal documentation, onboarding guides, runbooks, and infrastructure wikis.
- Enhance visibility: Maintain and improve AI system dashboards and reporting systems for onboarding timelines, RFA volume, and infrastructure program milestones.
Preferred Qualifications
- Familiarity with on-prem/HPC and/or multi cloud-based GPU infrastructure, orchestration tools, and platforms like Slurm, Run:AI, MLflow, W&B or similar systems is a huge plus.