Staff Machine Learning Engineer
Company | Tempus |
---|---|
Location | San Francisco, CA, USA, Remote in USA, Chicago, IL, USA, New York, NY, USA |
Salary | $170000 – $230000 |
Type | Full-Time |
Degrees | Master’s, PhD |
Experience Level | Senior, Expert or higher |
Requirements
- Master’s degree in Computer Science, Artificial Intelligence, Software Engineering, or a related field. A strong academic background with a focus on AI data engineering.
- Proven track record (8+ years of industry experience) in designing, building, and operating large-scale data pipelines and infrastructure in a production environment.
- Strong experience working with massive, heterogeneous datasets (TBs+) and modern distributed data processing tools and frameworks such as Apache Spark, Ray, or Dask.
- Strong, hands-on experience with tools and libraries specifically designed for large-scale ML data handling, such as Hugging Face Datasets, MosaicML Streaming, or similar frameworks (e.g., WebDataset, Petastorm). Experience with MLOps tools and platforms (e.g., MLflow, Kubeflow, SageMaker Pipelines).
- Understanding of the data challenges specific to training large models (Foundation Models, LLMs, Multimodal Models).
- Proficiency in programming languages like Python and experience with modern distributed data processing tools and frameworks.
- Proven ability to bring thought leadership to the product and engineering teams, influencing technical direction and data strategy.
- Experience mentoring junior engineers and collaborating effectively with cross-functional teams (Research Scientists, ML Engineers, Platform Engineers, Product Managers, Clinicians).
- Excellent communication skills, capable of explaining complex technical concepts to diverse audiences.
- Strong bias-to-action and ability to thrive in a fast-paced, dynamic research and development environment.
- A pragmatic approach focused on delivering rapid, iterative, and measurable progress towards impactful goals.
Responsibilities
- Architect and build sophisticated data processing workflows responsible for ingesting, processing, and preparing multimodal training data that seamlessly integrate with large-scale distributed ML training frameworks and infrastructure (GPU clusters).
- Develop strategies for efficient, compliant data ingestion from diverse sources, including internal databases, third-party APIs, public biomedical datasets, and Tempus’s proprietary data ecosystem.
- Utilize, optimize, and contribute to frameworks specialized for large-scale ML data loading and streaming (e.g., MosaicML Streaming, Ray Data, HF Datasets).
- Collaborate closely with infrastructure and platform teams to leverage and optimize cloud-native services (primarily GCP) for performance, cost-efficiency, and security.
- Engineer efficient connectors and data loaders for accessing and processing information from diverse knowledge sources, such as knowledge graphs, internal structured databases, biomedical literature repositories (e.g., PubMed), and curated ontologies.
- Optimize data storage for efficient large scale training training and knowledge access.
- Orchestrate, monitor, and troubleshoot complex data workflows using tools like Airflow, Kubeflow Pipelines.
- Establish robust monitoring, logging, and alerting systems for data pipeline health, data drift detection, and data quality assurance, providing feedback loops for continuous improvement.
- Analyze and optimize data I/O performance bottlenecks considering storage systems, network bandwidth and compute resources.
- Actively manage and seek optimizations for the costs associated with storing and processing massive datasets in the cloud.
Preferred Qualifications
- Advanced degree (PhD) in Computer Science, Engineering, Bioinformatics, or a related field.
- Contributions to relevant open-source projects.
- Direct experience working with clinical or biological data (EHR, genomics, medical imaging).