Staff Site Reliability Engineer
Company | Visa |
---|---|
Location | Ashburn, VA, USA |
Salary | $Not Provided – $Not Provided |
Type | Full-Time |
Degrees | Bachelor’s, Master’s, MBA, PharmD |
Experience Level | Senior, Expert or higher |
Requirements
- 5+ years of relevant work experience with a Bachelor’s Degree or at least 2 years of work experience with an Advanced degree (e.g. Masters, MBA, JD, MD) or 0 years of work experience with a PhD, OR 8+ years of relevant work experience.
- Hands on experience in Linux and Windows systems and good understanding of distributed computing environments.
- Intermediate level programming and/or scripting in 3 or more of the following: Python, Java, Go, PowerShell, JavaScript, Terraform, Ansible, Helm, Chef, Cloud Formation.
- 2+ years of experience managing CI/CD tooling such as Jenkins, Github, Bitbucket, ArgoCD, Artifactory, Bitbucket, Azure DevOps in a large-scale environment.
- 3+ Years experience managing observability tooling such as Grafana, Prometheus, Splunk, Datadog, New Relic, DynaTrace, Sentry, etc. in a large-scale environment.
- Advanced understanding of YAML, JSON, HTML, XML.
- 2+ years of work experience supporting relational and non-relational databases [MySQL, MongoDB, PostgreSQL, etc.), including creating and running queries, managing performance and scaling.
- Experience managing container infrastructure and supporting development transformation to a container first model.
- 3 or more years working in a Platform, SRE or Production Engineering group for high availability/critical platforms/applications.
- Exposure to Virtualization (Hyper-V, VMware, scvmm etc).
- Experience managing a distributed container platform including but not limited to deployment/release management, provisioning, capacity management, workload management.
Responsibilities
- Guide the instrumentation of monitoring for the Visa Cloud Platform (IaaS/PaaS/Container as a service).
- Ensure the platform target SLAs are met and implement appropriate SLIs for supporting services.
- Work with developers during service transition, evaluating reliability and operability of the applications and ensuring adequate monitoring, alerting and observability.
- Partner with peers within Operations & Infrastructure supporting ongoing maintenance and enhancement of the platform.
- Focus on setting standards for automating routine tasks and workflows in support of the larger DevEx SRE team.
- Support multiple internal stakeholders with a variety of technical challenges, analyze and discern patterns in issues, and propose solutions to these problems.
- Work in a 24/7/365 operation model, including shift or on-call support (weekend required).
Preferred Qualifications
- 6 or more years of work experience with a Bachelors Degree or 4 or more years of relevant experience with an Advanced Degree (e.g. Masters, MBA, JD, MD) or up to 3 years of relevant experience with a PhD.
- Master’s Degree in IT, CS or related field and/or 5+ years relevant work experience.