Senior Cloud Ops Engineer

Company	GoFundMe
Location	San Francisco, CA, USA
Salary	$156000 – $234000
Type	Full-Time
Degrees	Bachelor’s
Experience Level	Senior

Requirements

Bachelor’s Degree in Computer Science, a related field, or 8+ years of equivalent practical experience.
Minimum of 6 years of experience designing and managing scalable, cloud-based infrastructure, preferably in SaaS environments.
Deep technical expertise with a strong foundation in computer science, sharp engineering skills, and a commitment to delivering high-quality solutions.
Expert-level knowledge of AWS cloud services, container technologies like Docker and Kubernetes, and Infrastructure as Code (IaC) tools like Terraform and CloudFormation.
Proficiency in software architecture, including asynchronous event-driven architecture and microservices.
Experienced in performance and reliability testing using tools like Artillery, K6, or similar frameworks.
Experience in defining, monitoring, and managing Service Level Indicators (SLIs) and Service Level Objectives (SLOs) to ensure the cloud infrastructure consistently meets performance and availability targets.
Proven expertise in disaster recovery planning and execution, including developing and implementing robust strategies to maintain business continuity and achieve rapid recovery in the event of an outage.
Hands-on experience with application performance management (APM) tools like New Relic, DataDog, and Splunk.
Advanced scripting and development skills in Bash, PHP, and NodeJS languages.
Skilled in managing distributed data systems, troubleshooting complex issues under high load, and designing for high transaction volumes.
Knowledgeable in compliance regulations, including PCI, SOC2, and GDPR.

Responsibilities

Design and implement robust, fault-tolerant cloud solutions to process billions of dollars annually, ensuring scalability, resilience, and compliance.
Share expertise and foster a culture of continuous improvement, innovation, and learning within the team, contributing to technical mentorship and knowledge sharing.
Participate in strategic decisions regarding cloud architecture, influencing the adoption of best practices and cutting-edge technologies.
Work collaboratively to enhance system performance, observability, and reliability across the infrastructure, focusing on improving real-time monitoring and logging for operational excellence.
Lead initiatives to improve infrastructure resiliency, leveraging tools like AWS Resilience Hub and Fault Injection Simulator to test and enhance system robustness.
Drive application resilience by designing and executing load tests, simulating infrastructure faults, and analyzing results to improve fault tolerance.
Incorporate scalability and performance testing as integral parts of service design, ensuring services meet reliability and performance goals under high transaction volumes.
Embed testing phases within CI/CD pipelines to promote shift-left performance testing practices, improve efficiency, and reduce development cycle times.
Contribute to implementing and analyzing DORA (DevOps Research and Assessment) metrics to enhance the efficiency and effectiveness of the development lifecycle.
Participate in an on-call rotation to promptly address and resolve critical incidents, ensuring continuous operational excellence and rapid recovery during outages.

Preferred Qualifications

AWS cloud certifications.
Experience with fault-tolerant system design, large-scale distributed systems, and high-transaction environments.
Familiarity with tools and processes for infrastructure resiliency and fault injection testing.