Member of Technical Staff - ML Ops

Member of Technical Staff – ML Ops

Company	Captions
Location	New York, NY, USA
Salary	$160000 – $250000
Type	Full-Time
Degrees	Bachelor’s, Master’s
Experience Level	Mid Level, Senior

Requirements

Bachelor’s or Master’s degree in Computer Science, Machine Learning, or related field
Strong programming skills in Python and systems programming
Experience with distributed systems and scalable infrastructure
Track record of building reliable, performant large-scale ML systems
Deep expertise in PyTorch internals and distributed training frameworks (FSDP, DeepSpeed)
GPU cluster management and optimization
Performance profiling and systems optimization
CUDA programming and kernel optimization
Containerization and orchestration (Docker, Kubernetes)
ML model serving and deployment at scale
Language models and attention mechanism optimization
Video and audio processing pipelines
Large-scale diffusion models
Love diving deep into complex systems optimization challenges
Take ownership of critical infrastructure while collaborating effectively
Get excited about pushing the boundaries of ML system performance
Want to work directly with researchers on cutting-edge ML problems
Thrive in fast-paced, research-driven environments

Responsibilities

Develop and optimize distributed training frameworks integrating multiple modalities (video, audio, text, and structured metadata)
Build flexible systems for cross-modal training orchestration and efficient experimentation
Design reproducible training environments with versioned dependencies and configurations
Implement comprehensive testing frameworks for validating model training correctness and performance
Create infrastructure for systematic model quality assessment and performance benchmarking
Design and implement flexible training orchestration systems that balance research agility with large-scale model training
Build robust monitoring and observability systems for complex training and inference pipelines
Design and manage GPU clusters optimized for distributed training of multimodal models
Build out comprehensive automated metrics collection and alerting across our ML stack
Profile and optimize model training throughput using mixed precision, gradient checkpointing, and advanced memory techniques
Develop custom CUDA and Triton kernels to accelerate critical compute paths
Implement creative solutions for cost optimization across spot instances and reserved capacity
Design and optimize real-time inference systems enabling fast research iteration cycles
Build infrastructure enabling rapid testing of research hypotheses
Create systems supporting close collaboration between infrastructure and research teams
Develop frameworks for reproducible research experimentation
Enable seamless deployment of research innovations to production

Preferred Qualifications

Strong experience in some or all of the areas listed in the requirements