Posted in

Production Systems Engineer – Fleet AI Systems

Production Systems Engineer – Fleet AI Systems

CompanyMeta
LocationMenlo Park, CA, USA, Bellevue, WA, USA
Salary$132000 – $191000
TypeFull-Time
DegreesBachelor’s, Master’s
Experience LevelSenior

Requirements

  • Bachelor’s degree in Computer Science, Computer Engineering, relevant technical field, or equivalent practical experience
  • 6+ years of experience in hardware server system support, troubleshooting server architecture and components, analyzing, triaging, and solving systems level issues
  • Expertise with Linux and scripting (Python or similar)
  • 2+ years of experience in changing system configurations and measuring change impact, working through full lifecycle progressions of computer systems products
  • 2+ years of experience engineering innovations in support of different server system/data center products

Responsibilities

  • Drive interfacing with external vendors and internal hardware, mechanical, power, thermal, manufacturing and software engineers to understand system architecture to develop and execute the test suites for various architectures
  • Contribute as a leading member of the team, owning and proactively creating experiments and tooling to detect and diagnose hardware/firmware/software health issues, in organized and collaborative efforts
  • Develop test framework for large-scale test automation inside fleet during product development and after mass production
  • Implement remediations across software and hardware stack according to plan, while keeping a thorough procedure record and data log
  • Develop and publish updates on resolutions and communicate findings internally
  • Troubleshoot, diagnose and root cause of system failures and isolate the components/failure scenarios while working with internal & external stakeholders
  • Develop visibility through data visualization and implement systematic solutions to hardware health issues
  • Drive necessary discussion with external and internal teams on test specification and methodologies to improve test quality continuously
  • Develop robust, industry leading practices for supporting hardware infrastructure at scale

Preferred Qualifications

  • Master’s degree in Computer Science, Computer Engineering, relevant technical field, or equivalent practical experience
  • 5+ years of experience in production support at scale (e.g. – 10K storage servers and over 100K HDD) working through full system technologies
  • 2+ years of experience in post-production, hyperscale environments, delivering solutions to complex systems issues
  • 3+ years of experience supporting AI or HPC systems and/or related systems, at scale
  • 2+ years of experience working in a matrix organization, owning or driving initiatives as a leading contributor