Skip to content

HPC Engineer
Company | Chan Zuckerberg Biohub |
---|
Location | San Francisco, CA, USA |
---|
Salary | $192000 – $297000 |
---|
Type | Full-Time |
---|
Degrees | Bachelor’s |
---|
Experience Level | Senior, Expert or higher |
---|
Requirements
- Bachelor’s Degree in Computer Science, Mathematics, Systems Engineering or a related field or equivalent training/experience also acceptable
- A minimum of 7 years of experience with progressively increasing responsibility in HPC computing environments or complex Linux environments
- Experience building on-prem HPC infrastructure and capacity planning
- Experience and expertise working on complex issues where analysis of situations or data requires an in-depth evaluation of variable factors
- Experience supporting scientific facilities, and prior knowledge of scientific user needs, program management, data management planning or lab-bench IT needs
- Experience with HPC and cloud computing environments
- Ability to interact with a variety of technical and scientific personnel with varied academic backgrounds
- Strong written and verbal communication skills to present and disseminate scientific software developments at group meetings
- Demonstrated ability to reason clearly about load, latency, bandwidth, performance, reliability, and cost and make sound engineering decisions balancing them
- Demonstrated ability to quickly and creatively implement novel solutions and ideas
- Proven ability to analyze, troubleshoot, and resolve complex problems that arise in the HPC production storage hardware, software systems, storage networks and systems
- Configuring and administering parallel, network attached storage (Lustre, NFS, ESS, Ceph) and storage subsystems (e.g. IBM, NetApp, DataDirect Network, LSI, etc.)
- Installing, configuring, and maintaining job management tools (such as SLURM, Moab, TORQUE, PBS, etc.)
- Red Hat Enterprise Linux, CentOS, or derivatives and Linux services and technologies like dnsmasq, systemd, LDAP, PAM, sssd, OpenSSH, cgroups
- Scripting languages (including Bash, Python, or Perl)
- Virtualization (ESXi or KVM/libvirt), containerization (Docker or Singularity), configuration management and automation (tools like xCAT, Puppet, kickstart) and orchestration (Kubernetes, docker-compose, CloudFormation, Terraform.)
- High performance networking technologies (Ethernet and Infiniband) and hardware (Mellanox and Juniper)
- Configuring, installing, tuning and maintaining scientific application software
- Familiarity with source control tools (Git or SVN)
Responsibilities
- Manage cluster-level services via the SLURM scheduler as well as user facing services such as Open OnDemand and NoMachine
- Install, configure and optimize applications and provide user support
- Work closely with many different science teams simultaneously to translate experimental descriptions into software and hardware requirements and across all phases of the scientific lifecycle, including data ingest, analysis, management and storage, computation, authentication, tool development and many other computational needs expressed by scientific projects
Preferred Qualifications
- Understand and translate researchers’ scientific challenges into computational solutions
- Scientific background, research experience, and/or experience in a University or a research setting