Posted in

Senior Technical Program Manager – AI/ML & Data Infrastructure – Central Technology

Senior Technical Program Manager – AI/ML & Data Infrastructure – Central Technology

CompanyChan Zuckerberg Initiative
LocationSan Carlos, CA, USA
Salary$178000 – $267000
TypeFull-Time
Degrees
Experience LevelSenior, Expert or higher

Requirements

  • 7+ years of experience in technical program management or infrastructure-focused operations in complex engineering environments.
  • Proven ability to manage large-scale technical programs across multiple stakeholders and teams.
  • High-level understanding of machine learning workflows and model training pipelines, with the ability to translate infrastructure needs between research and engineering teams.
  • Strong organizational skills and experience leading cross-functional programs with tight timelines and multiple stakeholders.
  • Excellent written and verbal communication skills, including the ability to align stakeholders at multiple levels.
  • A passion for building efficient, secure, and inclusive systems to support cutting-edge science and research.
  • Familiarity with on-prem/HPC and/or multi cloud-based GPU infrastructure, orchestration tools, and platforms like Slurm, Run:AI, MLflow, W&B or similar systems is a huge plus.

Responsibilities

  • Lead AI/ML infrastructure programs: Drive execution of technical initiatives across GPU scheduling, platform enablement, observability, or workload orchestration.
  • Lead access and lifecycle workflows: Own the end-to-end experience for users accessing shared infrastructure resources—including onboarding, offboarding, documentation, and support processes.
  • Coordinate infrastructure access requests: Manage intake and operational workflows for machine learning infrastructure access, including triage, tracking, and communication.
  • Drive documentation systems: Own the structure, accuracy, and governance of internal documentation, onboarding guides, runbooks, and infrastructure wikis.
  • Enhance visibility: Maintain and improve AI system dashboards and reporting systems for onboarding timelines, RFA volume, and infrastructure program milestones.

Preferred Qualifications

  • Familiarity with on-prem/HPC and/or multi cloud-based GPU infrastructure, orchestration tools, and platforms like Slurm, Run:AI, MLflow, W&B or similar systems is a huge plus.