Skip to content

Site Reliability Engineer
Company | Fortinet |
---|
Location | Sunnyvale, CA, USA |
---|
Salary | $150000 – $195000 |
---|
Type | Full-Time |
---|
Degrees | |
---|
Experience Level | Mid Level |
---|
Requirements
- 3 years of Devops/SRE experience with production systems (depending on level)
- Strong development and automation skills.
- Extensive experience with Infrastructure as Code (Terraform, etc), as well as supporting tooling (Atlantis, ArgoCD, etc)
- Extensive experience with Kubernetes and supporting tooling (Helm, operators, etc)
- Extensive experience with a variety of cloud managed services and providers
- Experience building production quality cloud infrastructure that enables reliable and rapid deployment of microservices with effective monitoring and built in high availability and/or fault tolerance.
- Strong passion for using automation to create simple repeatable dev and ops patterns that ensures a stable, reliable experience for customers.
- Strong cross-team communication skills.
- Experience with the building blocks of large-scale systems including load balancing, distributed/cloud computing, containers, instrumentation, and monitoring.
- Knowledge of cloud networking, including VPC configuration and cross-cloud connectivity.
- Familiarity with one or more programming languages (Python, Golang, etc).
Responsibilities
- Automate as much as reasonable to significantly improve operational efficiency of the Lacework platform
- Design, build and improve our infrastructure to enhance service scalability, resiliency, and efficiency across the company.
- Identify mission-critical problems and solve them via automation, tooling, communication, and informed design.
- Build and improve monitoring and instrumentation to predict future scalability or failure risks and solve them before they manifest into customer-facing issues.
- Facilitate company-wide visibility into key metrics, SLAs, and milestones so that scale and resiliency are a part of every conversation.
- Develop best practices alongside engineering/operations teams to improve the scalability and reliability of internal processes.
- Participate in an on-call rotation.
Preferred Qualifications
- Experience with monitoring and observability systems and tools (Prometheus, Grafana, New Relic, DataDog, etc.)
- Believe everything should be ‘as code’
- Experience with Java application servers and JVM configuration