Senior Software Engineer – ML Training Platform
Company | DoorDash |
---|---|
Location | Seattle, WA, USA, San Francisco, CA, USA, Sunnyvale, CA, USA |
Salary | $130600 – $285000 |
Type | Full-Time |
Degrees | |
Experience Level | Senior |
Requirements
- 6+ years of industry experience in software engineering, with a deep understanding of distributed systems and data-intensive ML pipelines in production.
- Hands-On ML Platform/Infra Experience – You’re familiar with modern machine learning stacks (e.g., PyTorch, LightGBM, TensorFlow) and have built or maintained large-scale training environments.
- Strong CS fundamentals – You excel at crafting solutions that handle scale, complexity, and reliability challenges.
- Proven Project Ownership – You can break down complex initiatives, estimate accurately, and deliver major projects with minimal oversight.
- Collaboration & Communication – You’re adept at partnering across functions, setting expectations, and ensuring alignment among diverse stakeholders.
- Thrive on Continuous Improvement – You proactively identify gaps, reduce technical debt, and optimize resource usage, balancing cost and performance.
Responsibilities
- Drive Key Training Initiatives – Own and deliver significant sub-projects that enhance our platform’s performance, reliability, and ease of use.
- Architect & Implement Scalable Solutions – Design resilient pipelines for distributed model training (e.g., PyTorch, LightGBM) on Kubernetes, optimizing for both short-term speed and long-term maintainability.
- Collaborate with Cross-Functional Teams – Work with ML engineers, Data Scientists, and product stakeholders to refine requirements, set realistic milestones, and ensure smooth delivery.
- Set a High Bar for Quality & Reliability – Lead by example with clean, high-performance code, thorough design reviews, and a focus on observability, incident mitigation, and continuous improvement.
- Mentor & Influence – Help level up peers by sharing knowledge, driving best practices, and contributing to a supportive team culture that values empathy and technical excellence.
Preferred Qualifications
- GPU Acceleration – Experience with GPU-enabled training and its associated performance optimizations.
- MLOps Tooling – Familiarity with orchestration and tracking frameworks such as Metaflow, MLflow, Dagster, or Airflow.
- Large-Scale Data Processing – Knowledge of Spark, Hadoop, or other distributed data processing technologies.
- Monitoring & Observability – Proficiency with metrics and alerting solutions (e.g., Prometheus, Grafana).
- Cloud Platforms – Experience with AWS or GCP for scalable compute, container orchestration, and cost management.