Skip to content

Software Engineer – Machine Learning Infrastructure
Company | DatologyAI |
---|
Location | San Carlos, CA, USA |
---|
Salary | $180000 – $250000 |
---|
Type | Full-Time |
---|
Degrees | |
---|
Experience Level | Senior |
---|
Requirements
- 5+ years of experience
- Have meaningful experience with leading and building production ML infrastructure and platforms that deliver on major product initiatives.
- Proficiency in Python and in the most commonly used tools in the infrastructure space: Linux, Kubernetes, Terraform / Pulumi, etc
- Strong knowledge of hardening cloud native and especially K8s workloads.
- Experience maintaining a high-quality bar for design, correctness, and testing.
- Have a humble attitude, an eagerness to help your colleagues, and a desire to do whatever it takes to make the team succeed
- Own problems end-to-end and are willing to pick up whatever knowledge you’re missing to get the job done.
- Experience running data-processing workloads in k8s (e.g spark on k8s)
Responsibilities
- Architect, build and maintain the infrastructure that ensures highly available GPU workloads for training-purposes
- Troubleshoot and resolve issues across GPU resources, networking, OS, drivers, and cloud environments, automate detection and recovery of such issues
- Design, build, and maintain the infrastructure that powers our data curation product.
- Partner with researchers and engineers to bring new features and research capabilities to our customers
- Ensure that our infrastructure and systems are reliable, secure, and worthy of our customers’ trust.
Preferred Qualifications
No preferred qualifications provided.