Production Engineer – Storage
Company | Crusoe |
---|---|
Location | San Francisco, CA, USA, Sunnyvale, CA, USA |
Salary | $183000 – $210000 |
Type | Full-Time |
Degrees | |
Experience Level | Senior |
Requirements
- 5+ years of professional experience in SRE, systems, or storage engineering.
- Hands-on experience with distributed storage systems (e.g., Ceph, GlusterFS, OpenEBS) and deep understanding of object, block, and file storage paradigms.
- Proficiency in a programming language such as Python, Go, Java, or C.
- Experience with Infrastructure as Code and deployment tooling such as Terraform, Ansible, or Puppet.
- Deep knowledge of Linux internals with a focus on I/O subsystems, memory management, and storage scheduling.
- Familiarity with storage protocols like NFS, SMB, iSCSI, or NVMe-oF.
- Strong experience working with containerized workloads and orchestration platforms (e.g., Kubernetes, Docker).
- Excellent incident response, troubleshooting, and documentation practices.
- Experience with building and operating managed services at scale such as object, file and block storage (AWS, GCP, Azure).
- Excellent communication skills.
- Must be able to pass a background check.
Responsibilities
- Build automation and self-healing tools to monitor and maintain Crusoe’s distributed cloud storage infrastructure, which includes block, file, and object storage systems.
- Drive reliability initiatives focused on data replication, encryption, backup and restore strategies, and robust failover mechanisms.
- Collaborate closely with storage engineers to help implement and maintain high-performance NVMe- and SSD-backed volumes that support large-scale AI compute clusters.
- Support user-facing storage services with a focus on availability, performance tuning, and adherence to error budgets.
- Investigate and resolve storage-related incidents using deep telemetry, logs, and performance profiling.
- Partner with hardware and kernel teams to diagnose low-level I/O issues and optimize I/O paths, cache policies, and file systems.
- Contribute to the architecture of fault-tolerant, scalable storage backends tailored for AI-first cloud environments.
Preferred Qualifications
- Contributions to open-source storage projects or the Linux storage stack.
- Experience with hybrid storage models across on-prem and cloud environments.
- Familiarity with high-throughput network topologies for storage backplanes (e.g., RoCE, RDMA, InfiniBand).