Posted in

Hardware Systems Engineer – AI Systems

Hardware Systems Engineer – AI Systems

CompanyMeta
LocationAustin, TX, USA, Menlo Park, CA, USA
Salary$132000 – $191000
TypeFull-Time
DegreesBachelor’s
Experience LevelSenior, Expert or higher

Requirements

  • Bachelor’s degree in Computer Science, Computer Engineering, relevant technical field, or equivalent practical experience
  • 6+ years of work experience in one or more domains such as: ASIC development (Silicon design or bring-up or characterization), compute (ARM, x86), AI-ML hardware/software (GPUs, TPUs), Storage (SSD/HDD), Memory (DRAM), Network (NIC), Server Interconnect Technologies-PCIe etc
  • Knowledge of architecture and components on one of the following products: server/PC/Laptop
  • Development or debug experience in one or more following areas: hardware fault management, error reporting, error handling on hardware products

Responsibilities

  • Interface with external vendors and internal hardware, mechanical, power, thermal, manufacturing and software engineers to understand system architecture to guide and develop Hardware Fault Management for various server products
  • Drive new platform enablement for the Meta fleet (hardware validation, tooling specification and integration, customer workload testing)
  • Proactively create experiments and tooling to detect and diagnose hardware/firmware/software health issues
  • Leverage demonstrated understanding RAS (reliability, availability, serviceability) to improve error reporting and error handling mechanism for better operation quality and cost/efficiency
  • Drive engineering and operational rigor by establishing metrics and process for regular assessment and improvement
  • Enhance understanding through data visualization and implement systemic solutions to hardware health issues
  • Troubleshoot, diagnose and root cause of system failures and isolate the components/failure scenarios while working with internal & external stakeholders

Preferred Qualifications

  • 7+ years of experience with one subset of the following AI systems: Accelerator (GPU/ASIC), Performance characterization/optimization/tracing/debugging (e.g., NVIDIA, AMD, Intel, or other miscellaneous accelerator), Computer Architecture, HPC Communication Libraries (e.g., NCCL, MPI), GPU Interconnect Technologies (NVLink/XGMI etc)
  • Experience with architecture of disaggregated systems at scale
  • New Product Introduction (NPI) experience for at scale deployment
  • Experience troubleshooting problems at system level, crossing across multiple components, as well as hardware/firmware/software boundaries