Posted in

Staff Site Reliability Engineer

Staff Site Reliability Engineer

CompanyIllumio
LocationSunnyvale, CA, USA
Salary$192000 – $230000
TypeFull-Time
DegreesBachelor’s
Experience LevelSenior, Expert or higher

Requirements

  • Bachelor’s degree in Computer Science, Engineering, or related field; or equivalent work experience
  • 8+ years of relevant SRE experience
  • Strong hands-on experience with AWS and Azure
  • Familiarity with Kubernetes and containerized environments
  • Knowledge of networking concepts, such as DNS, load balancing, and firewalls
  • Proficient in diagnosing and resolving complex issues in SaaS environments, including performance bottlenecks and application errors
  • Proficiency in at least one programming language (e.g., Python, Go, Java) and scripting languages (e.g., Bash, PowerShell)
  • Experience with tools like Datadog, New Relic, Prometheus, Grafana, ELK, or Azure Monitor
  • Familiarity with tools like Ansible, Terraform, or CloudFormation
  • Knowledge of debugging and optimizing relational databases (e.g., PostgreSQL, MySQL) and caching systems (e.g., Redis, Memcached)
  • Experience with incident management tools and processes, including conducting RCAs and improving on-call processes

Responsibilities

  • Investigate and resolve production incidents and escalations to ensure minimal downtime and impact to customers
  • Work closely with engineering and support teams to troubleshoot application and infrastructure issues
  • Proactively monitor application health, performance, and reliability using modern observability tools
  • Analyze trends in system behavior and suggest performance improvements
  • Develop and maintain automation scripts and tools to improve operational efficiency and incident resolution
  • Create and enhance runbooks to streamline troubleshooting and reduce mean time to resolution (MTTR)
  • Conduct thorough post-incident reviews to identify root causes and implement preventive measures
  • Drive a culture of continuous improvement by documenting lessons learned and improving system designs
  • Partner with software engineers, QA, and product teams to improve application stability and user experience
  • Act as a bridge between development and operations, ensuring smooth and reliable service delivery

Preferred Qualifications

    No preferred qualifications provided.