Posted in

Senior AI/HPC Storage Engineer

Senior AI/HPC Storage Engineer

CompanyRecursion Pharmaceuticals
LocationSalt Lake City, UT, USA, Toronto, ON, Canada
Salary$160000 – $182000
TypeFull-Time
Degrees
Experience LevelSenior

Requirements

  • Deep expertise in parallel file systems, specifically IBM Storage Scale (GPFS), including policy management for data lifecycle, disaster recovery, snapshots, and tiered storage strategies.
  • Debugging and resolving GPFS production issues, such as hanging directories, degraded states, and performance bottlenecks, to ensure system stability and optimal performance.
  • Ability to hit the ground running, working autonomously to identify, propose, and implement improvements in storage solutions.
  • GPFS AFM (Active File Management) experience is a big plus.
  • Strong understanding of storage access methods and the differences between parallel file systems, NAS (e.g., NFS), and object storage.
  • Experience leading and optimizing on-premise storage solutions (primarily GPFS) and integrating with hybrid object storage (MinIO).
  • Ability to define and implement data lifecycle management policies to optimize storage efficiency and cost-effectiveness.
  • Familiarity with RDMA-capable high-speed networking for storage performance optimization.
  • Strong troubleshooting skills in complex Linux-based computing and storage environments.
  • Experience working with storage vendor support for debugging, troubleshooting, and adding new features (e.g., AFM, GPFS policies, support tickets, white pages, etc ).
  • Ability to manage hardware support in the datacenter, including: Coordinating RMA processes for failed components. Upgrading software and firmware on GPFS hardware. Using screen/tmux/console sessions for remote support and troubleshooting. Showing up in the Datacenter(s) as needed when remote support doesn’t suffice. Coordinating the installation and procurement of new hardware.
  • Basic Git experience is required; knowledge of CI/CD, GitOps, and Infrastructure as Code (IaC) is nice to have but can be learned on the job.
  • Some exposure to software-defined infrastructure and cloud storage solutions (GCP, On-Prem) is a plus.
  • Python and Bash scripting experience for automation and operational efficiency.
  • Bonus: Experience with Slurm and Kubernetes for job scheduling and containerized workloads (e.g., Apptainer, Docker).
  • Strong verbal and written communication skills for documentation and collaboration.
  • Ability to mentor and guide team members, helping to establish best practices in storage management.

Responsibilities

  • Designing, implementing, testing, maintaining, and optimizing our data storage infrastructure and services, utilizing an Infrastructure as Code approach across both on-premises and public cloud environments.
  • Driving innovation across all storage tiers within our AI/HPC infrastructure, ensuring we deliver a scalable and effective data platform to support our mission.
  • Developing scripts and workflows to automate and verify storage infrastructure provisioning and dynamic reconfiguration, enhancing support for our AI/HPC storage environments.
  • Conducting performance analysis, benchmarking, troubleshooting and fine-tuning of our data storage systems and services, while efficiently managing user tickets.
  • Researching, deploying, and optimizing accessibility, performance, security, and data lifecycle management policies.
  • Regular assessments of our storage platforms’ health and operational performance against established metrics, with a focus on meeting and exceeding operational service level objectives.
  • Leading in technical communication and customer collaboration to ensure high levels of customer satisfaction.

Preferred Qualifications

  • GPFS AFM (Active File Management) experience is a big plus.
  • Knowledge of CI/CD, GitOps, and Infrastructure as Code (IaC) is nice to have but can be learned on the job.
  • Some exposure to software-defined infrastructure and cloud storage solutions (GCP, On-Prem) is a plus.
  • Bonus: Experience with Slurm and Kubernetes for job scheduling and containerized workloads (e.g., Apptainer, Docker).