Posted in

Senior Cloud Ops Engineer

Senior Cloud Ops Engineer

CompanyGoFundMe
LocationSan Francisco, CA, USA
Salary$156000 – $234000
TypeFull-Time
DegreesBachelor’s
Experience LevelSenior

Requirements

  • Bachelor’s Degree in Computer Science, a related field, or 8+ years of equivalent practical experience.
  • Minimum of 6 years of experience designing and managing scalable, cloud-based infrastructure, preferably in SaaS environments.
  • Deep technical expertise with a strong foundation in computer science, sharp engineering skills, and a commitment to delivering high-quality solutions.
  • Expert-level knowledge of AWS cloud services, container technologies like Docker and Kubernetes, and Infrastructure as Code (IaC) tools like Terraform and CloudFormation.
  • Proficiency in software architecture, including asynchronous event-driven architecture and microservices.
  • Experienced in performance and reliability testing using tools like Artillery, K6, or similar frameworks.
  • Experience in defining, monitoring, and managing Service Level Indicators (SLIs) and Service Level Objectives (SLOs) to ensure the cloud infrastructure consistently meets performance and availability targets.
  • Proven expertise in disaster recovery planning and execution, including developing and implementing robust strategies to maintain business continuity and achieve rapid recovery in the event of an outage.
  • Hands-on experience with application performance management (APM) tools like New Relic, DataDog, and Splunk.
  • Advanced scripting and development skills in Bash, PHP, and NodeJS languages.
  • Skilled in managing distributed data systems, troubleshooting complex issues under high load, and designing for high transaction volumes.
  • Knowledgeable in compliance regulations, including PCI, SOC2, and GDPR.

Responsibilities

  • Design and implement robust, fault-tolerant cloud solutions to process billions of dollars annually, ensuring scalability, resilience, and compliance.
  • Share expertise and foster a culture of continuous improvement, innovation, and learning within the team, contributing to technical mentorship and knowledge sharing.
  • Participate in strategic decisions regarding cloud architecture, influencing the adoption of best practices and cutting-edge technologies.
  • Work collaboratively to enhance system performance, observability, and reliability across the infrastructure, focusing on improving real-time monitoring and logging for operational excellence.
  • Lead initiatives to improve infrastructure resiliency, leveraging tools like AWS Resilience Hub and Fault Injection Simulator to test and enhance system robustness.
  • Drive application resilience by designing and executing load tests, simulating infrastructure faults, and analyzing results to improve fault tolerance.
  • Incorporate scalability and performance testing as integral parts of service design, ensuring services meet reliability and performance goals under high transaction volumes.
  • Embed testing phases within CI/CD pipelines to promote shift-left performance testing practices, improve efficiency, and reduce development cycle times.
  • Contribute to implementing and analyzing DORA (DevOps Research and Assessment) metrics to enhance the efficiency and effectiveness of the development lifecycle.
  • Participate in an on-call rotation to promptly address and resolve critical incidents, ensuring continuous operational excellence and rapid recovery during outages.

Preferred Qualifications

  • AWS cloud certifications.
  • Experience with fault-tolerant system design, large-scale distributed systems, and high-transaction environments.
  • Familiarity with tools and processes for infrastructure resiliency and fault injection testing.