Skip to content

Member of Technical Staff – ML Ops
Company | Captions |
---|
Location | New York, NY, USA |
---|
Salary | $160000 – $250000 |
---|
Type | Full-Time |
---|
Degrees | Bachelor’s, Master’s |
---|
Experience Level | Mid Level, Senior |
---|
Requirements
- Bachelor’s or Master’s degree in Computer Science, Machine Learning, or related field
- Strong programming skills in Python and systems programming
- Experience with distributed systems and scalable infrastructure
- Track record of building reliable, performant large-scale ML systems
- Deep expertise in PyTorch internals and distributed training frameworks (FSDP, DeepSpeed)
- GPU cluster management and optimization
- Performance profiling and systems optimization
- CUDA programming and kernel optimization
- Containerization and orchestration (Docker, Kubernetes)
- ML model serving and deployment at scale
- Language models and attention mechanism optimization
- Video and audio processing pipelines
- Large-scale diffusion models
- Love diving deep into complex systems optimization challenges
- Take ownership of critical infrastructure while collaborating effectively
- Get excited about pushing the boundaries of ML system performance
- Want to work directly with researchers on cutting-edge ML problems
- Thrive in fast-paced, research-driven environments
Responsibilities
- Develop and optimize distributed training frameworks integrating multiple modalities (video, audio, text, and structured metadata)
- Build flexible systems for cross-modal training orchestration and efficient experimentation
- Design reproducible training environments with versioned dependencies and configurations
- Implement comprehensive testing frameworks for validating model training correctness and performance
- Create infrastructure for systematic model quality assessment and performance benchmarking
- Design and implement flexible training orchestration systems that balance research agility with large-scale model training
- Build robust monitoring and observability systems for complex training and inference pipelines
- Design and manage GPU clusters optimized for distributed training of multimodal models
- Build out comprehensive automated metrics collection and alerting across our ML stack
- Profile and optimize model training throughput using mixed precision, gradient checkpointing, and advanced memory techniques
- Develop custom CUDA and Triton kernels to accelerate critical compute paths
- Implement creative solutions for cost optimization across spot instances and reserved capacity
- Design and optimize real-time inference systems enabling fast research iteration cycles
- Build infrastructure enabling rapid testing of research hypotheses
- Create systems supporting close collaboration between infrastructure and research teams
- Develop frameworks for reproducible research experimentation
- Enable seamless deployment of research innovations to production
Preferred Qualifications
- Strong experience in some or all of the areas listed in the requirements