Posted in

Member of Technical Staff – ML Ops

Member of Technical Staff – ML Ops

CompanyCaptions
LocationNew York, NY, USA
Salary$160000 – $250000
TypeFull-Time
DegreesBachelor’s, Master’s
Experience LevelMid Level, Senior

Requirements

  • Bachelor’s or Master’s degree in Computer Science, Machine Learning, or related field
  • Strong programming skills in Python and systems programming
  • Experience with distributed systems and scalable infrastructure
  • Track record of building reliable, performant large-scale ML systems
  • Deep expertise in PyTorch internals and distributed training frameworks (FSDP, DeepSpeed)
  • GPU cluster management and optimization
  • Performance profiling and systems optimization
  • CUDA programming and kernel optimization
  • Containerization and orchestration (Docker, Kubernetes)
  • ML model serving and deployment at scale
  • Language models and attention mechanism optimization
  • Video and audio processing pipelines
  • Large-scale diffusion models
  • Love diving deep into complex systems optimization challenges
  • Take ownership of critical infrastructure while collaborating effectively
  • Get excited about pushing the boundaries of ML system performance
  • Want to work directly with researchers on cutting-edge ML problems
  • Thrive in fast-paced, research-driven environments

Responsibilities

  • Develop and optimize distributed training frameworks integrating multiple modalities (video, audio, text, and structured metadata)
  • Build flexible systems for cross-modal training orchestration and efficient experimentation
  • Design reproducible training environments with versioned dependencies and configurations
  • Implement comprehensive testing frameworks for validating model training correctness and performance
  • Create infrastructure for systematic model quality assessment and performance benchmarking
  • Design and implement flexible training orchestration systems that balance research agility with large-scale model training
  • Build robust monitoring and observability systems for complex training and inference pipelines
  • Design and manage GPU clusters optimized for distributed training of multimodal models
  • Build out comprehensive automated metrics collection and alerting across our ML stack
  • Profile and optimize model training throughput using mixed precision, gradient checkpointing, and advanced memory techniques
  • Develop custom CUDA and Triton kernels to accelerate critical compute paths
  • Implement creative solutions for cost optimization across spot instances and reserved capacity
  • Design and optimize real-time inference systems enabling fast research iteration cycles
  • Build infrastructure enabling rapid testing of research hypotheses
  • Create systems supporting close collaboration between infrastructure and research teams
  • Develop frameworks for reproducible research experimentation
  • Enable seamless deployment of research innovations to production

Preferred Qualifications

  • Strong experience in some or all of the areas listed in the requirements