Posted in

Bare Metal Technical Program Manager

Bare Metal Technical Program Manager

CompanyCoreWeave
LocationBellevue, WA, USA, Sunnyvale, CA, USA
Salary$122000 – $187000
TypeFull-Time
DegreesBachelor’s
Experience LevelSenior

Requirements

  • Bachelor’s degree in Computer Science, Electrical Engineering, or equivalent experience
  • 5+ years of experience in hands-on management and support of complex bare metal infrastructure environments and data center operations
  • Comprehensive understanding of modern server hardware architectures, including specialized compute accelerators (GPUs) and high-speed interconnect technologies from leading high-performance computing vendors such as NVIDIA, Dell, or HPE
  • Demonstrated expertise in Linux system administration, encompassing deep familiarity with command-line operations and system configuration
  • Proficiency in at least one high-level scripting language (e.g., Python) and practical experience with infrastructure and/or network automation tools, methodologies, and frameworks (e.g., Ansible)
  • Extensive experience with modern infrastructure monitoring and logging tools such as Prometheus, Grafana, and the ELK stack (Elasticsearch, Logstash, Kibana)
  • Working knowledge of enterprise ticketing systems (e.g., Jira) and an understanding of IT Service Management (ITSM) frameworks and best practices
  • Strong analytical and problem-solving skills, with the ability to systematically diagnose and resolve complex technical issues
  • Excellent communication and collaboration abilities, with experience working effectively across multidisciplinary technical teams
  • Self-motivated and proactive, with a demonstrated sense of ownership and a commitment to ensuring infrastructure reliability and performance
  • Proven ability to manage multiple tasks and priorities effectively in a fast-paced and dynamic environment

Responsibilities

  • Provide expert-level technical support and in-depth troubleshooting for a wide spectrum of hardware and associated software issues, encompassing server malfunctions, network outages, and performance degradations
  • Manage the lifecycle of our bare metal infrastructure, including overseeing deployment methodologies, executing maintenance procedures, coordinating upgrades, and managing hardware retirement processes
  • Architect and implement automation solutions through scripting and tooling to streamline repetitive operational tasks, enhance overall efficiency, and minimize manual intervention across the infrastructure
  • Lead the development and refinement of critical operational processes, comprehensive technical documentation (SOPs, TSGs, runbooks), and the establishment of engineering best practices to bolster team effectiveness and infrastructure resilience
  • Engage in close collaboration with Software, Network, and Data Center Operations Engineering teams to facilitate effective issue resolution, contribute to strategic project planning, and ensure the cohesive operation of the entire infrastructure ecosystem
  • Serve as a key technical point of contact for hardware and software vendors, managing technical support engagements, overseeing the RMA process, and driving the resolution of complex hardware-centric challenges
  • Design, deploy, and maintain sophisticated monitoring and alerting frameworks to proactively identify and mitigate potential infrastructure anomalies and performance deviations
  • Participate actively in incident response protocols, conduct thorough root cause analysis (RCAs) for infrastructure events, and contribute to problem management strategies aimed at preventing future occurrences
  • Contribute technical expertise to and potentially lead infrastructure-focused projects, including new hardware deployments, critical system upgrades, and the integration of new operational tooling
  • Mentor and guide junior engineering team members, fostering technical growth and contributing to the development of internal knowledge resources and training programs
  • Maintain the integrity of hardware asset tracking and related data within our infrastructure inventory systems (e.g., Snipe-IT)
  • Adhere to and promote stringent security protocols and best practices related to infrastructure access and maintenance activities

Preferred Qualifications

  • Ability to work effectively across multidisciplinary technical teams
  • Experience in a fast-paced and dynamic environment