Hardware Systems Engineer – AI Systems
Company | Meta |
---|---|
Location | Austin, TX, USA, Menlo Park, CA, USA |
Salary | $132000 – $191000 |
Type | Full-Time |
Degrees | Bachelor’s |
Experience Level | Senior, Expert or higher |
Requirements
- Bachelor’s degree in Computer Science, Computer Engineering, relevant technical field, or equivalent practical experience
- 6+ years of work experience in one or more domains such as: ASIC development (Silicon design or bring-up or characterization), compute (ARM, x86), AI-ML hardware/software (GPUs, TPUs), Storage (SSD/HDD), Memory (DRAM), Network (NIC), Server Interconnect Technologies-PCIe etc
- Knowledge of architecture and components on one of the following products: server/PC/Laptop
- Development or debug experience in one or more following areas: hardware fault management, error reporting, error handling on hardware products
Responsibilities
- Interface with external vendors and internal hardware, mechanical, power, thermal, manufacturing and software engineers to understand system architecture to guide and develop Hardware Fault Management for various server products
- Drive new platform enablement for the Meta fleet (hardware validation, tooling specification and integration, customer workload testing)
- Proactively create experiments and tooling to detect and diagnose hardware/firmware/software health issues
- Leverage demonstrated understanding RAS (reliability, availability, serviceability) to improve error reporting and error handling mechanism for better operation quality and cost/efficiency
- Drive engineering and operational rigor by establishing metrics and process for regular assessment and improvement
- Enhance understanding through data visualization and implement systemic solutions to hardware health issues
- Troubleshoot, diagnose and root cause of system failures and isolate the components/failure scenarios while working with internal & external stakeholders
Preferred Qualifications
- 7+ years of experience with one subset of the following AI systems: Accelerator (GPU/ASIC), Performance characterization/optimization/tracing/debugging (e.g., NVIDIA, AMD, Intel, or other miscellaneous accelerator), Computer Architecture, HPC Communication Libraries (e.g., NCCL, MPI), GPU Interconnect Technologies (NVLink/XGMI etc)
- Experience with architecture of disaggregated systems at scale
- New Product Introduction (NPI) experience for at scale deployment
- Experience troubleshooting problems at system level, crossing across multiple components, as well as hardware/firmware/software boundaries