Software Engineer – Accelerator Solutions & Technologies
Company | Meta |
---|---|
Location | Menlo Park, CA, USA, New York, NY, USA |
Salary | $56.25 – $173000 |
Type | Full-Time |
Degrees | Bachelor’s, Master’s, PhD |
Experience Level | Junior, Mid Level |
Requirements
- Currently has, or is in the process of obtaining a Bachelor’s degree in Computer Science, Computer Engineering, relevant technical field, or equivalent practical experience. Degree must be completed prior to joining Meta.
- Masters or PhD in Computer Science, Computer Engineering, or any other relevant technical field
- 2+ years experience in developing C++ codebase
- 2+ years experience in developing Python codebase
- Understanding of performance, benchmarking measurement, and optimization on collective communications and distributed at-scale model training
Responsibilities
- Contribute to our developer infrastructure, including simulation and HW emulation platforms, to enable performance measurement and optimization for Meta’s in-house accelerator programs
- Understand and contribute to the collective communications library, intended to be deployed on Meta’s AI/ML superclusters.
- Support networking and compute hardware acceleration techniques to improve ML inference and training model performance.
- Perform architectural analysis to ensure system designs meet performance, scalability, and reliability requirements.
- Implement simulation models for Meta’s Accelerator ASICs, develop and analyze various scenarios to evaluate data center performance and identify potential improvements.
- Collaborate with architects and engineers to integrate simulation results into system design processes.
- Use instruction set simulators to define performant firmware for Meta’s training/inference accelerators.
- Collaborate with hardware and firmware teams to ensure accurate modeling and simulation of accelerator functionalities.
- Analyze simulation results to guide firmware development and optimization efforts.
Preferred Qualifications
- Full-stack experience and understanding of AI/HPC systems, with a focus on the application layer and performance optimizations
- Familiarity with relevant tools, libraries, and frameworks (e.g., PyTorch, CUDA)
- Knowledge of AI/HPC hardware requirements and specifications (e.g., configuring hardware components for AI/HPC workloads)
- Understanding of the transport stack (e.g., RoCE) and its constraints particularly pertaining to interconnect and collective
- Experience with SystemC