Posted in

Software Engineer – Machine Learning Infrastructure

Software Engineer – Machine Learning Infrastructure

CompanyDatologyAI
LocationSan Carlos, CA, USA
Salary$180000 – $250000
TypeFull-Time
Degrees
Experience LevelSenior

Requirements

  • 5+ years of experience
  • Have meaningful experience with leading and building production ML infrastructure and platforms that deliver on major product initiatives.
  • Proficiency in Python and in the most commonly used tools in the infrastructure space: Linux, Kubernetes, Terraform / Pulumi, etc
  • Strong knowledge of hardening cloud native and especially K8s workloads.
  • Experience maintaining a high-quality bar for design, correctness, and testing.
  • Have a humble attitude, an eagerness to help your colleagues, and a desire to do whatever it takes to make the team succeed
  • Own problems end-to-end and are willing to pick up whatever knowledge you’re missing to get the job done.
  • Experience running data-processing workloads in k8s (e.g spark on k8s)

Responsibilities

  • Architect, build and maintain the infrastructure that ensures highly available GPU workloads for training-purposes
  • Troubleshoot and resolve issues across GPU resources, networking, OS, drivers, and cloud environments, automate detection and recovery of such issues
  • Design, build, and maintain the infrastructure that powers our data curation product.
  • Partner with researchers and engineers to bring new features and research capabilities to our customers
  • Ensure that our infrastructure and systems are reliable, secure, and worthy of our customers’ trust.

Preferred Qualifications

    No preferred qualifications provided.