HPC Engineer

Company	Chan Zuckerberg Biohub
Location	San Francisco, CA, USA
Salary	$192000 – $297000
Type	Full-Time
Degrees	Bachelor’s
Experience Level	Senior, Expert or higher

Requirements

Bachelor’s Degree in Computer Science, Mathematics, Systems Engineering or a related field or equivalent training/experience also acceptable
A minimum of 7 years of experience with progressively increasing responsibility in HPC computing environments or complex Linux environments
Experience building on-prem HPC infrastructure and capacity planning
Experience and expertise working on complex issues where analysis of situations or data requires an in-depth evaluation of variable factors
Experience supporting scientific facilities, and prior knowledge of scientific user needs, program management, data management planning or lab-bench IT needs
Experience with HPC and cloud computing environments
Ability to interact with a variety of technical and scientific personnel with varied academic backgrounds
Strong written and verbal communication skills to present and disseminate scientific software developments at group meetings
Demonstrated ability to reason clearly about load, latency, bandwidth, performance, reliability, and cost and make sound engineering decisions balancing them
Demonstrated ability to quickly and creatively implement novel solutions and ideas
Proven ability to analyze, troubleshoot, and resolve complex problems that arise in the HPC production storage hardware, software systems, storage networks and systems
Configuring and administering parallel, network attached storage (Lustre, NFS, ESS, Ceph) and storage subsystems (e.g. IBM, NetApp, DataDirect Network, LSI, etc.)
Installing, configuring, and maintaining job management tools (such as SLURM, Moab, TORQUE, PBS, etc.)
Red Hat Enterprise Linux, CentOS, or derivatives and Linux services and technologies like dnsmasq, systemd, LDAP, PAM, sssd, OpenSSH, cgroups
Scripting languages (including Bash, Python, or Perl)
Virtualization (ESXi or KVM/libvirt), containerization (Docker or Singularity), configuration management and automation (tools like xCAT, Puppet, kickstart) and orchestration (Kubernetes, docker-compose, CloudFormation, Terraform.)
High performance networking technologies (Ethernet and Infiniband) and hardware (Mellanox and Juniper)
Configuring, installing, tuning and maintaining scientific application software
Familiarity with source control tools (Git or SVN)

Responsibilities

Manage cluster-level services via the SLURM scheduler as well as user facing services such as Open OnDemand and NoMachine
Install, configure and optimize applications and provide user support
Work closely with many different science teams simultaneously to translate experimental descriptions into software and hardware requirements and across all phases of the scientific lifecycle, including data ingest, analysis, management and storage, computation, authentication, tool development and many other computational needs expressed by scientific projects

Preferred Qualifications

Understand and translate researchers’ scientific challenges into computational solutions
Scientific background, research experience, and/or experience in a University or a research setting