Senior AI/HPC Storage Engineer

Company	Recursion Pharmaceuticals
Location	Salt Lake City, UT, USA, Toronto, ON, Canada
Salary	$160000 – $182000
Type	Full-Time
Degrees
Experience Level	Senior

Requirements

Deep expertise in parallel file systems, specifically IBM Storage Scale (GPFS), including policy management for data lifecycle, disaster recovery, snapshots, and tiered storage strategies.
Debugging and resolving GPFS production issues, such as hanging directories, degraded states, and performance bottlenecks, to ensure system stability and optimal performance.
Ability to hit the ground running, working autonomously to identify, propose, and implement improvements in storage solutions.
GPFS AFM (Active File Management) experience is a big plus.
Strong understanding of storage access methods and the differences between parallel file systems, NAS (e.g., NFS), and object storage.
Experience leading and optimizing on-premise storage solutions (primarily GPFS) and integrating with hybrid object storage (MinIO).
Ability to define and implement data lifecycle management policies to optimize storage efficiency and cost-effectiveness.
Familiarity with RDMA-capable high-speed networking for storage performance optimization.
Strong troubleshooting skills in complex Linux-based computing and storage environments.
Experience working with storage vendor support for debugging, troubleshooting, and adding new features (e.g., AFM, GPFS policies, support tickets, white pages, etc ).
Ability to manage hardware support in the datacenter, including: Coordinating RMA processes for failed components. Upgrading software and firmware on GPFS hardware. Using screen/tmux/console sessions for remote support and troubleshooting. Showing up in the Datacenter(s) as needed when remote support doesn’t suffice. Coordinating the installation and procurement of new hardware.
Basic Git experience is required; knowledge of CI/CD, GitOps, and Infrastructure as Code (IaC) is nice to have but can be learned on the job.
Some exposure to software-defined infrastructure and cloud storage solutions (GCP, On-Prem) is a plus.
Python and Bash scripting experience for automation and operational efficiency.
Bonus: Experience with Slurm and Kubernetes for job scheduling and containerized workloads (e.g., Apptainer, Docker).
Strong verbal and written communication skills for documentation and collaboration.
Ability to mentor and guide team members, helping to establish best practices in storage management.

Responsibilities

Designing, implementing, testing, maintaining, and optimizing our data storage infrastructure and services, utilizing an Infrastructure as Code approach across both on-premises and public cloud environments.
Driving innovation across all storage tiers within our AI/HPC infrastructure, ensuring we deliver a scalable and effective data platform to support our mission.
Developing scripts and workflows to automate and verify storage infrastructure provisioning and dynamic reconfiguration, enhancing support for our AI/HPC storage environments.
Conducting performance analysis, benchmarking, troubleshooting and fine-tuning of our data storage systems and services, while efficiently managing user tickets.
Researching, deploying, and optimizing accessibility, performance, security, and data lifecycle management policies.
Regular assessments of our storage platforms’ health and operational performance against established metrics, with a focus on meeting and exceeding operational service level objectives.
Leading in technical communication and customer collaboration to ensure high levels of customer satisfaction.

Preferred Qualifications

GPFS AFM (Active File Management) experience is a big plus.
Knowledge of CI/CD, GitOps, and Infrastructure as Code (IaC) is nice to have but can be learned on the job.
Some exposure to software-defined infrastructure and cloud storage solutions (GCP, On-Prem) is a plus.
Bonus: Experience with Slurm and Kubernetes for job scheduling and containerized workloads (e.g., Apptainer, Docker).