Senior Software Engineer - ML Training Platform

Senior Software Engineer – ML Training Platform

Company	DoorDash
Location	Seattle, WA, USA, San Francisco, CA, USA, Sunnyvale, CA, USA
Salary	$130600 – $285000
Type	Full-Time
Degrees
Experience Level	Senior

6+ years of industry experience in software engineering, with a deep understanding of distributed systems and data-intensive ML pipelines in production.
Hands-On ML Platform/Infra Experience – You’re familiar with modern machine learning stacks (e.g., PyTorch, LightGBM, TensorFlow) and have built or maintained large-scale training environments.
Strong CS fundamentals – You excel at crafting solutions that handle scale, complexity, and reliability challenges.
Proven Project Ownership – You can break down complex initiatives, estimate accurately, and deliver major projects with minimal oversight.
Collaboration & Communication – You’re adept at partnering across functions, setting expectations, and ensuring alignment among diverse stakeholders.
Thrive on Continuous Improvement – You proactively identify gaps, reduce technical debt, and optimize resource usage, balancing cost and performance.

Drive Key Training Initiatives – Own and deliver significant sub-projects that enhance our platform’s performance, reliability, and ease of use.
Architect & Implement Scalable Solutions – Design resilient pipelines for distributed model training (e.g., PyTorch, LightGBM) on Kubernetes, optimizing for both short-term speed and long-term maintainability.
Collaborate with Cross-Functional Teams – Work with ML engineers, Data Scientists, and product stakeholders to refine requirements, set realistic milestones, and ensure smooth delivery.
Set a High Bar for Quality & Reliability – Lead by example with clean, high-performance code, thorough design reviews, and a focus on observability, incident mitigation, and continuous improvement.
Mentor & Influence – Help level up peers by sharing knowledge, driving best practices, and contributing to a supportive team culture that values empathy and technical excellence.

GPU Acceleration – Experience with GPU-enabled training and its associated performance optimizations.
MLOps Tooling – Familiarity with orchestration and tracking frameworks such as Metaflow, MLflow, Dagster, or Airflow.
Large-Scale Data Processing – Knowledge of Spark, Hadoop, or other distributed data processing technologies.
Monitoring & Observability – Proficiency with metrics and alerting solutions (e.g., Prometheus, Grafana).
Cloud Platforms – Experience with AWS or GCP for scalable compute, container orchestration, and cost management.