Skip to content

Hardware Systems Engineer – AI Systems
Company | Meta |
---|
Location | Menlo Park, CA, USA |
---|
Salary | $132000 – $191000 |
---|
Type | Full-Time |
---|
Degrees | Bachelor’s |
---|
Experience Level | Senior, Expert or higher |
---|
Requirements
- 5+ years of work experience in one or more domains such as: ASIC development (Silicon design or bring-up or characterization), compute (ARM, x86), AI-ML hardware/software (GPUs, TPUs), Storage (SSD/HDD), Memory(DRAM), Network (NIC), Server Interconnect Technologies-PCIe etc.
- Knowledge of architecture and components on one of the following products: server/PC/Laptop.
- Development or debug experience in one or more following areas: hardware fault management, error reporting, error handling on hardware products.
- Bachelor’s degree in Computer Science, Computer Engineering, relevant technical field, or equivalent practical experience
Responsibilities
- Interface with external vendors and internal hardware, mechanical, power, thermal, manufacturing and software engineers to understand system architecture to guide and develop Hardware Fault Management for various server products.
- Drive new platform enablement for the Meta fleet (hardware validation, tooling specification and integration, customer workload testing).
- Proactively create experiments and tooling to detect and diagnose hardware/firmware/software health issues.
- Leverage demonstrated understanding RAS (reliability, availability, serviceability) to improve error reporting and error handling mechanism for better operation quality and cost/efficiency.
- Drive engineering and operational rigor by establishing metrics and process for regular assessment and improvement.
- Develop visibility through data visualization and implement systemic solutions to hardware health issues.
- Troubleshoot, diagnose and root cause of system failures and isolate the components/failure scenarios while working with internal & external stakeholders.
Preferred Qualifications
- 7+ years of experience with one subset of the following AI systems: Accelerator (GPU/ASIC), Performance characterization/optimization/tracing/debugging (e.g., NVIDIA, AMD, Intel, or other misc accelerator), Computer Architecture, HPC Communication Libraries (e.g., NCCL, MPI), GPU Interconnect Technologies (NVLink/XGMI etc).
- Experience with architecture of disaggregated systems at scale.
- NPI experience for at scale deployment.
- Experience troubleshooting problems at system level, crossing across multiple components, as well as hardware/firmware/software boundaries.