Production Systems Engineer – Fleet AI Systems
Company | Meta |
---|---|
Location | Menlo Park, CA, USA, Bellevue, WA, USA |
Salary | $132000 – $191000 |
Type | Full-Time |
Degrees | Bachelor’s, Master’s |
Experience Level | Senior |
Requirements
- Bachelor’s degree in Computer Science, Computer Engineering, relevant technical field, or equivalent practical experience
- 6+ years of experience in hardware server system support, troubleshooting server architecture and components, analyzing, triaging, and solving systems level issues
- Expertise with Linux and scripting (Python or similar)
- 2+ years of experience in changing system configurations and measuring change impact, working through full lifecycle progressions of computer systems products
- 2+ years of experience engineering innovations in support of different server system/data center products
Responsibilities
- Drive interfacing with external vendors and internal hardware, mechanical, power, thermal, manufacturing and software engineers to understand system architecture to develop and execute the test suites for various architectures
- Contribute as a leading member of the team, owning and proactively creating experiments and tooling to detect and diagnose hardware/firmware/software health issues, in organized and collaborative efforts
- Develop test framework for large-scale test automation inside fleet during product development and after mass production
- Implement remediations across software and hardware stack according to plan, while keeping a thorough procedure record and data log
- Develop and publish updates on resolutions and communicate findings internally
- Troubleshoot, diagnose and root cause of system failures and isolate the components/failure scenarios while working with internal & external stakeholders
- Develop visibility through data visualization and implement systematic solutions to hardware health issues
- Drive necessary discussion with external and internal teams on test specification and methodologies to improve test quality continuously
- Develop robust, industry leading practices for supporting hardware infrastructure at scale
Preferred Qualifications
- Master’s degree in Computer Science, Computer Engineering, relevant technical field, or equivalent practical experience
- 5+ years of experience in production support at scale (e.g. – 10K storage servers and over 100K HDD) working through full system technologies
- 2+ years of experience in post-production, hyperscale environments, delivering solutions to complex systems issues
- 3+ years of experience supporting AI or HPC systems and/or related systems, at scale
- 2+ years of experience working in a matrix organization, owning or driving initiatives as a leading contributor