Posted in

Hardware Systems Engineer – AI Systems

Hardware Systems Engineer – AI Systems

CompanyMeta
LocationMenlo Park, CA, USA
Salary$132000 – $191000
TypeFull-Time
DegreesBachelor’s
Experience LevelSenior, Expert or higher

Requirements

  • 5+ years of work experience in one or more domains such as: ASIC development (Silicon design or bring-up or characterization), compute (ARM, x86), AI-ML hardware/software (GPUs, TPUs), Storage (SSD/HDD), Memory(DRAM), Network (NIC), Server Interconnect Technologies-PCIe etc.
  • Knowledge of architecture and components on one of the following products: server/PC/Laptop.
  • Development or debug experience in one or more following areas: hardware fault management, error reporting, error handling on hardware products.
  • Bachelor’s degree in Computer Science, Computer Engineering, relevant technical field, or equivalent practical experience

Responsibilities

  • Interface with external vendors and internal hardware, mechanical, power, thermal, manufacturing and software engineers to understand system architecture to guide and develop Hardware Fault Management for various server products.
  • Drive new platform enablement for the Meta fleet (hardware validation, tooling specification and integration, customer workload testing).
  • Proactively create experiments and tooling to detect and diagnose hardware/firmware/software health issues.
  • Leverage demonstrated understanding RAS (reliability, availability, serviceability) to improve error reporting and error handling mechanism for better operation quality and cost/efficiency.
  • Drive engineering and operational rigor by establishing metrics and process for regular assessment and improvement.
  • Develop visibility through data visualization and implement systemic solutions to hardware health issues.
  • Troubleshoot, diagnose and root cause of system failures and isolate the components/failure scenarios while working with internal & external stakeholders.

Preferred Qualifications

  • 7+ years of experience with one subset of the following AI systems: Accelerator (GPU/ASIC), Performance characterization/optimization/tracing/debugging (e.g., NVIDIA, AMD, Intel, or other misc accelerator), Computer Architecture, HPC Communication Libraries (e.g., NCCL, MPI), GPU Interconnect Technologies (NVLink/XGMI etc).
  • Experience with architecture of disaggregated systems at scale.
  • NPI experience for at scale deployment.
  • Experience troubleshooting problems at system level, crossing across multiple components, as well as hardware/firmware/software boundaries.