Posted in

Senior Software Engineer – ML Training Platform

Senior Software Engineer – ML Training Platform

CompanyDoorDash
LocationSeattle, WA, USA, San Francisco, CA, USA, Sunnyvale, CA, USA
Salary$130600 – $285000
TypeFull-Time
Degrees
Experience LevelSenior

Requirements

  • 6+ years of industry experience in software engineering, with a deep understanding of distributed systems and data-intensive ML pipelines in production.
  • Hands-On ML Platform/Infra Experience – You’re familiar with modern machine learning stacks (e.g., PyTorch, LightGBM, TensorFlow) and have built or maintained large-scale training environments.
  • Strong CS fundamentals – You excel at crafting solutions that handle scale, complexity, and reliability challenges.
  • Proven Project Ownership – You can break down complex initiatives, estimate accurately, and deliver major projects with minimal oversight.
  • Collaboration & Communication – You’re adept at partnering across functions, setting expectations, and ensuring alignment among diverse stakeholders.
  • Thrive on Continuous Improvement – You proactively identify gaps, reduce technical debt, and optimize resource usage, balancing cost and performance.

Responsibilities

  • Drive Key Training Initiatives – Own and deliver significant sub-projects that enhance our platform’s performance, reliability, and ease of use.
  • Architect & Implement Scalable Solutions – Design resilient pipelines for distributed model training (e.g., PyTorch, LightGBM) on Kubernetes, optimizing for both short-term speed and long-term maintainability.
  • Collaborate with Cross-Functional Teams – Work with ML engineers, Data Scientists, and product stakeholders to refine requirements, set realistic milestones, and ensure smooth delivery.
  • Set a High Bar for Quality & Reliability – Lead by example with clean, high-performance code, thorough design reviews, and a focus on observability, incident mitigation, and continuous improvement.
  • Mentor & Influence – Help level up peers by sharing knowledge, driving best practices, and contributing to a supportive team culture that values empathy and technical excellence.

Preferred Qualifications

  • GPU Acceleration – Experience with GPU-enabled training and its associated performance optimizations.
  • MLOps Tooling – Familiarity with orchestration and tracking frameworks such as Metaflow, MLflow, Dagster, or Airflow.
  • Large-Scale Data Processing – Knowledge of Spark, Hadoop, or other distributed data processing technologies.
  • Monitoring & Observability – Proficiency with metrics and alerting solutions (e.g., Prometheus, Grafana).
  • Cloud Platforms – Experience with AWS or GCP for scalable compute, container orchestration, and cost management.