Production Systems Engineer
Company | Meta |
---|---|
Location | Menlo Park, CA, USA |
Salary | $132000 – $191000 |
Type | Full-Time |
Degrees | Bachelor’s, Master’s, PhD |
Experience Level | Senior |
Requirements
- Bachelor’s degree in Computer Science, Computer Engineering, relevant technical field, or equivalent practical experience
- 6+ years of hands-on Software/Firmware/Hardware Engineering to build systems/products for the IT industry
- Troubleshooting and data tooling, including data analysis, building analytical models, and visualizations
- Knowledge of server architecture and components across Compute/Storage/AI Systems/Networking
Responsibilities
- Drive innovation in hardware efficiency by applying expertise in hardware utilization and performance, and translating insights into actionable strategies for hardware, power, performance and data center optimization
- Contribute to industry leading research in hardware characterization and fleet/DC efficiency studies across AI platforms, leveraging data-driven and machine learning analytical techniques
- Conduct in-depth hardware parameter based research and comparative analyses using advanced data analytics and machine learning techniques for failure analysis and diagnosis in production
- Interface with internal hardware, software engineers and operations teams to understand system architectures and failure modes
- Proactively create experiments, data analysis and data visualizations to detect and diagnose hardware health issues, focusing on systemic solutions
- Collaborate on evolving AI platforms, silicon products, thermal and cooling solutions to support the growth of large language models, with a focus on optimizing performance, scalability, and efficiency
- Develop data frameworks and discover insights to answer relationship between hardware, data center parameters and server failures
- Develop and implement data-driven strategies using hardware characterization studies to support hardware fleet optimizations and efficiency while supporting improvements to future platform designs
- Build comprehensive monitoring and predictive frameworks and to make data insights available to partner teams for decision making
- Collaborate with cross-functional teams to ingest and present data on evolving domains and specialized hardware technologies, components, datacenters
- Share insights with stakeholders and software teams to develop architectures to handle server failures based on hardware health data
- Troubleshoot, diagnose and root cause of system failures and isolate the components or failure scenarios with in depth statistical studies while working with stakeholders internally and externally
Preferred Qualifications
- Master’s degree or PhD in Computer Engineering, Electrical Engineering, or related field
- Experienced in the integration of lab tools for automated workflows
- Proficient in SQL, Python or C/C++ (data structures, algorithms, and OOP)
- Experience with Linux systems and server systems management
- Experience with some of the following modules/domains: PCIe, Networking, Flash, Memory, CPU, GPU, DRAM (DDR4/5 or HBM)