Posted in

Production Systems Engineer

Production Systems Engineer

CompanyMeta
LocationMenlo Park, CA, USA
Salary$132000 – $191000
TypeFull-Time
DegreesBachelor’s, Master’s, PhD
Experience LevelSenior

Requirements

  • Bachelor’s degree in Computer Science, Computer Engineering, relevant technical field, or equivalent practical experience
  • 6+ years of hands-on Software/Firmware/Hardware Engineering to build systems/products for the IT industry
  • Troubleshooting and data tooling, including data analysis, building analytical models, and visualizations
  • Knowledge of server architecture and components across Compute/Storage/AI Systems/Networking

Responsibilities

  • Drive innovation in hardware efficiency by applying expertise in hardware utilization and performance, and translating insights into actionable strategies for hardware, power, performance and data center optimization
  • Contribute to industry leading research in hardware characterization and fleet/DC efficiency studies across AI platforms, leveraging data-driven and machine learning analytical techniques
  • Conduct in-depth hardware parameter based research and comparative analyses using advanced data analytics and machine learning techniques for failure analysis and diagnosis in production
  • Interface with internal hardware, software engineers and operations teams to understand system architectures and failure modes
  • Proactively create experiments, data analysis and data visualizations to detect and diagnose hardware health issues, focusing on systemic solutions
  • Collaborate on evolving AI platforms, silicon products, thermal and cooling solutions to support the growth of large language models, with a focus on optimizing performance, scalability, and efficiency
  • Develop data frameworks and discover insights to answer relationship between hardware, data center parameters and server failures
  • Develop and implement data-driven strategies using hardware characterization studies to support hardware fleet optimizations and efficiency while supporting improvements to future platform designs
  • Build comprehensive monitoring and predictive frameworks and to make data insights available to partner teams for decision making
  • Collaborate with cross-functional teams to ingest and present data on evolving domains and specialized hardware technologies, components, datacenters
  • Share insights with stakeholders and software teams to develop architectures to handle server failures based on hardware health data
  • Troubleshoot, diagnose and root cause of system failures and isolate the components or failure scenarios with in depth statistical studies while working with stakeholders internally and externally

Preferred Qualifications

  • Master’s degree or PhD in Computer Engineering, Electrical Engineering, or related field
  • Experienced in the integration of lab tools for automated workflows
  • Proficient in SQL, Python or C/C++ (data structures, algorithms, and OOP)
  • Experience with Linux systems and server systems management
  • Experience with some of the following modules/domains: PCIe, Networking, Flash, Memory, CPU, GPU, DRAM (DDR4/5 or HBM)