Posted in

Member of Technical Staff – Training Data Infrastructure

Member of Technical Staff – Training Data Infrastructure

CompanyCaptions
LocationNew York, NY, USA
Salary$170000 – $250000
TypeFull-Time
DegreesBachelor’s, Master’s
Experience LevelMid Level, Senior

Requirements

  • Bachelor’s or Master’s degree in Computer Science, Machine Learning, or related field.
  • 3+ years experience in ML infrastructure development or large-scale data engineering.
  • Strong programming skills, particularly in Python and distributed computing frameworks.
  • Expertise in building and optimizing high-throughput data pipelines.
  • Proven experience with video/image data pre-processing and feature engineering.
  • Deep knowledge of machine learning workflows, including model training and data loading systems.
  • Track record in performance optimization and system scaling.
  • Experience with cluster management and distributed computing.
  • Background in MLOps and infrastructure monitoring.
  • Demonstrated ability to build reliable, large-scale data processing systems.

Responsibilities

  • Build performant pipelines for processing video and multimodal training data at scale.
  • Design distributed systems that scale seamlessly with our rapidly growing video and multimodal datasets.
  • Create efficient data loading systems optimized for GPU training throughput.
  • Implement comprehensive telemetry for video processing and training pipelines.
  • Create foundation data processing systems that intelligently cache and reuse expensive computations across the training pipeline.
  • Build robust data validation and quality measurement systems for video and multimodal content.
  • Design systems for data versioning and reproducing complex multimodal training runs.
  • Develop efficient storage and compute patterns for high-dimensional data and learned representations.
  • Own and improve end-to-end training pipeline performance.
  • Build systems for efficient storage and retrieval of video training data.
  • Build frameworks for systematic data and model quality improvement.
  • Develop infrastructure supporting fast research iteration cycles.
  • Build tools and systems for deep understanding of our training data characteristics.
  • Build infrastructure enabling rapid testing of research hypotheses.
  • Create systems for incorporating user feedback into training workflows.
  • Design measurement frameworks that connect model improvements to user outcomes.
  • Enable systematic experimentation with direct user feedback loops.

Preferred Qualifications

    No preferred qualifications provided.