Skip to content

Senior Machine Learning Infrastructure Engineer
Company | PlusAI |
---|
Location | Santa Clara, CA, USA |
---|
Salary | $160000 – $200000 |
---|
Type | Full-Time |
---|
Degrees | Master’s, PhD |
---|
Experience Level | Senior |
---|
Requirements
- Phd or MS in Computer Science, Electrical Engineering, or related field
- Good oral and written communication skills
- Phd new grad or Masters with 3+ years of software engineering experience with a focus on ML infrastructure or distributed systems
- Proficiency in in Python, C++, SQL
- Deep understanding of containerization, orchestration technologies, distributed ML workload, and experiment tracking tools (e.g., Docker, Kubernetes, multiprocessing, Kubeflow, and mlflow)
- Deploy and manage resources across multiple cloud platforms (AWS, GCP, or on-prem environments)
- Proficiency in at least one deep learning framework, such as PyTorch and data pipeline tools (e.g., Apache Airflow, Prefect)
- Strong knowledge of distributed systems, databases, and storage solutions
- Extensive software design and development skills
- Ability to learn and adapt to new technologies and contribute in a productive environment
Responsibilities
- Design and develop scalable, high-performance systems for training, inference, deploying, and monitoring ML models at scale
- Build and maintain efficient data pipelines, model versioning systems, and experiment tracking frameworks
- Collaborate with cross-functional teams, including ML researchers and engineers, to identify bottlenecks and improve platform usability
- Implement distributed systems and storage solutions optimized for machine learning workloads
- Drive improvements in CI/CD workflows for ML models and infrastructure
- Ensure high availability and reliability of the ML platform by implementing robust monitoring, logging, and alerting systems
- Stay current with industry trends and integrate relevant tools and frameworks to enhance the platform
- Mentor junior engineers and contribute to a culture of technical excellence
Preferred Qualifications
- Familiarity with fundamental deep learning architectures, such as Convolutional Neural Networks (CNNs) and Transformer models
- Experience in building large-scale ML datasets, MLOps pipelines, and distributed computing frameworks like Ray
- Experience working with autonomous vehicles or robotics
- Ensure that your work is performed in accordance with the company’s Quality Management System (QMS) requirements and contribute to continuous improvement efforts
- Ensure that technical work meets customer requirements, regulatory standards, and company quality policies