Skip to content

Sr. Platform Engineer-GenAI
Company | KLA |
---|
Location | Ann Arbor, MI, USA |
---|
Salary | $108100 – $183800 |
---|
Type | Full-Time |
---|
Degrees | Bachelor’s |
---|
Experience Level | Senior, Expert or higher |
---|
Requirements
- Bachelor’s Degree or equivalent training/certifications in Computer Science or related IT field
- Eight (8) years of implementing and maintaining AI/ML Infrastructure On-Prem environment
- Strong experience with AI/ML infrastructure and tools, including GPU clusters and Kubernetes
- Proficiency in deploying and managing open-source GenAI components and vector databases
- Hands-on experience with high-performance computing (HPC) environments
- Expertise in designing and managing on-premises, cloud, and hybrid-based ML platforms
- Solid understanding of distributed storage systems, scheduling systems, and high availability capabilities
Responsibilities
- Identify and resolve infrastructure gaps to ensure reliable, efficient, and scalable solutions
- Develop advanced AI/ML infrastructure solutions that enhance the efficiency of our skilled ML teams
- Design and implement solutions for critical areas, including distributed storage systems, scheduling systems, high availability capabilities, and core reliability issues within our large-scale GPU clusters
- Monitor and optimize the performance of our AI/ML infrastructure, ensuring high availability, scalability, and efficient resource utilization
- Develop and deploy automation tools, monitoring solutions, and operational strategies to streamline infrastructure management and reduce manual tasks
- Work with various teams, including ML developers, data engineers, and DevOps professionals, to create a cohesive and integrated AI/ML infrastructure ecosystem
- Implement and manage GPU infrastructure within Kubernetes clusters to support high-performance computing and AI/ML tasks
- Deploy and manage open-source GenAI components, such as vector databases and various AI/ML models, ensuring seamless integration and optimal performance
- Evaluate and integrate new open-source GenAI tools and technologies to enhance the platform’s capabilities
- Collaborate with the research and development teams to implement and optimize innovative AI/ML models and algorithms
- Ensure the security and compliance of open-source GenAI components within the infrastructure
- Leverage High-Performance Computing (HPC) experience to optimize and manage large-scale AI/ML workloads
- Design, implement, and manage on-premises, cloud, and hybrid-based ML platforms to support diverse AI/ML workloads and ensure flexibility and scalability
Preferred Qualifications
No preferred qualifications provided.