Hardware Systems Engineer - AI Systems

Hardware Systems Engineer – AI Systems

5+ years of work experience in one or more domains such as: ASIC development (Silicon design or bring-up or characterization), compute (ARM, x86), AI-ML hardware/software (GPUs, TPUs), Storage (SSD/HDD), Memory(DRAM), Network (NIC), Server Interconnect Technologies-PCIe etc.
Knowledge of architecture and components on one of the following products: server/PC/Laptop.
Development or debug experience in one or more following areas: hardware fault management, error reporting, error handling on hardware products.
Bachelor’s degree in Computer Science, Computer Engineering, relevant technical field, or equivalent practical experience

Interface with external vendors and internal hardware, mechanical, power, thermal, manufacturing and software engineers to understand system architecture to guide and develop Hardware Fault Management for various server products.
Drive new platform enablement for the Meta fleet (hardware validation, tooling specification and integration, customer workload testing).
Proactively create experiments and tooling to detect and diagnose hardware/firmware/software health issues.
Leverage demonstrated understanding RAS (reliability, availability, serviceability) to improve error reporting and error handling mechanism for better operation quality and cost/efficiency.
Drive engineering and operational rigor by establishing metrics and process for regular assessment and improvement.
Develop visibility through data visualization and implement systemic solutions to hardware health issues.
Troubleshoot, diagnose and root cause of system failures and isolate the components/failure scenarios while working with internal & external stakeholders.

7+ years of experience with one subset of the following AI systems: Accelerator (GPU/ASIC), Performance characterization/optimization/tracing/debugging (e.g., NVIDIA, AMD, Intel, or other misc accelerator), Computer Architecture, HPC Communication Libraries (e.g., NCCL, MPI), GPU Interconnect Technologies (NVLink/XGMI etc).
Experience with architecture of disaggregated systems at scale.
NPI experience for at scale deployment.
Experience troubleshooting problems at system level, crossing across multiple components, as well as hardware/firmware/software boundaries.