Senior AI/HPC Storage Engineer
Company | Recursion Pharmaceuticals |
---|---|
Location | Salt Lake City, UT, USA, Toronto, ON, Canada |
Salary | $160000 – $182000 |
Type | Full-Time |
Degrees | |
Experience Level | Senior |
Requirements
- Deep expertise in parallel file systems, specifically IBM Storage Scale (GPFS), including policy management for data lifecycle, disaster recovery, snapshots, and tiered storage strategies.
- Debugging and resolving GPFS production issues, such as hanging directories, degraded states, and performance bottlenecks, to ensure system stability and optimal performance.
- Ability to hit the ground running, working autonomously to identify, propose, and implement improvements in storage solutions.
- GPFS AFM (Active File Management) experience is a big plus.
- Strong understanding of storage access methods and the differences between parallel file systems, NAS (e.g., NFS), and object storage.
- Experience leading and optimizing on-premise storage solutions (primarily GPFS) and integrating with hybrid object storage (MinIO).
- Ability to define and implement data lifecycle management policies to optimize storage efficiency and cost-effectiveness.
- Familiarity with RDMA-capable high-speed networking for storage performance optimization.
- Strong troubleshooting skills in complex Linux-based computing and storage environments.
- Experience working with storage vendor support for debugging, troubleshooting, and adding new features (e.g., AFM, GPFS policies, support tickets, white pages, etc ).
- Ability to manage hardware support in the datacenter, including: Coordinating RMA processes for failed components. Upgrading software and firmware on GPFS hardware. Using screen/tmux/console sessions for remote support and troubleshooting. Showing up in the Datacenter(s) as needed when remote support doesn’t suffice. Coordinating the installation and procurement of new hardware.
- Basic Git experience is required; knowledge of CI/CD, GitOps, and Infrastructure as Code (IaC) is nice to have but can be learned on the job.
- Some exposure to software-defined infrastructure and cloud storage solutions (GCP, On-Prem) is a plus.
- Python and Bash scripting experience for automation and operational efficiency.
- Bonus: Experience with Slurm and Kubernetes for job scheduling and containerized workloads (e.g., Apptainer, Docker).
- Strong verbal and written communication skills for documentation and collaboration.
- Ability to mentor and guide team members, helping to establish best practices in storage management.
Responsibilities
- Designing, implementing, testing, maintaining, and optimizing our data storage infrastructure and services, utilizing an Infrastructure as Code approach across both on-premises and public cloud environments.
- Driving innovation across all storage tiers within our AI/HPC infrastructure, ensuring we deliver a scalable and effective data platform to support our mission.
- Developing scripts and workflows to automate and verify storage infrastructure provisioning and dynamic reconfiguration, enhancing support for our AI/HPC storage environments.
- Conducting performance analysis, benchmarking, troubleshooting and fine-tuning of our data storage systems and services, while efficiently managing user tickets.
- Researching, deploying, and optimizing accessibility, performance, security, and data lifecycle management policies.
- Regular assessments of our storage platforms’ health and operational performance against established metrics, with a focus on meeting and exceeding operational service level objectives.
- Leading in technical communication and customer collaboration to ensure high levels of customer satisfaction.
Preferred Qualifications
- GPFS AFM (Active File Management) experience is a big plus.
- Knowledge of CI/CD, GitOps, and Infrastructure as Code (IaC) is nice to have but can be learned on the job.
- Some exposure to software-defined infrastructure and cloud storage solutions (GCP, On-Prem) is a plus.
- Bonus: Experience with Slurm and Kubernetes for job scheduling and containerized workloads (e.g., Apptainer, Docker).