Production Systems Engineer - Fleet AI Systems

Production Systems Engineer – Fleet AI Systems

Company	Meta
Location	Menlo Park, CA, USA, Bellevue, WA, USA
Salary	$132000 – $191000
Type	Full-Time
Degrees	Bachelor’s, Master’s
Experience Level	Senior

Bachelor’s degree in Computer Science, Computer Engineering, relevant technical field, or equivalent practical experience
6+ years of experience in hardware server system support, troubleshooting server architecture and components, analyzing, triaging, and solving systems level issues
Expertise with Linux and scripting (Python or similar)
2+ years of experience in changing system configurations and measuring change impact, working through full lifecycle progressions of computer systems products
2+ years of experience engineering innovations in support of different server system/data center products

Drive interfacing with external vendors and internal hardware, mechanical, power, thermal, manufacturing and software engineers to understand system architecture to develop and execute the test suites for various architectures
Contribute as a leading member of the team, owning and proactively creating experiments and tooling to detect and diagnose hardware/firmware/software health issues, in organized and collaborative efforts
Develop test framework for large-scale test automation inside fleet during product development and after mass production
Implement remediations across software and hardware stack according to plan, while keeping a thorough procedure record and data log
Develop and publish updates on resolutions and communicate findings internally
Troubleshoot, diagnose and root cause of system failures and isolate the components/failure scenarios while working with internal & external stakeholders
Develop visibility through data visualization and implement systematic solutions to hardware health issues
Drive necessary discussion with external and internal teams on test specification and methodologies to improve test quality continuously
Develop robust, industry leading practices for supporting hardware infrastructure at scale

Master’s degree in Computer Science, Computer Engineering, relevant technical field, or equivalent practical experience
5+ years of experience in production support at scale (e.g. – 10K storage servers and over 100K HDD) working through full system technologies
2+ years of experience in post-production, hyperscale environments, delivering solutions to complex systems issues
3+ years of experience supporting AI or HPC systems and/or related systems, at scale
2+ years of experience working in a matrix organization, owning or driving initiatives as a leading contributor