Skip to content

Staff Software Engineer – Reliability Engineer – Observability
Company | The Home Depot |
---|
Location | Georgia, USA |
---|
Salary | $120000 – $190000 |
---|
Type | Full-Time |
---|
Degrees | Bachelor’s |
---|
Experience Level | Senior |
---|
Requirements
- Must be eighteen years of age or older.
- Must be legally permitted to work in the United States.
- The knowledge, skills and abilities typically acquired through the completion of a bachelor’s degree program or equivalent degree in a field of study related to the job.
- 3 years of work experience.
Responsibilities
- Develops, tests, deploys, and maintains software, with a clear understanding of the value the software is to provide.
- Takes a broad view when approaching issues; using a global lens.
- Consistently achieves results, even under tough circumstances.
- Develops test suites (functional, destructive, etc) to enable success, rapid deployment of code to production.
- Takes on new opportunities and tough challenges with a sense of urgency, high energy and enthusiasm.
- Actively seeks ways to grow and be challenged using both formal and informal development channels.
- Learns through successful and failed experiment when tackling new problems.
- Creates new and better ways for the organization to be successful.
- Delivers multi-mode communications that convey a clear understanding of the unique needs of different audiences.
- Works the Product Team to ensure user stories are developer ready, easy to understand and testable.
- Collaborates with other team members in agile processes.
- Relates openly and comfortably with diverse groups of people.
- Adapts approach and demeanor in real time to match the shifting demands of different situations.
- Fields questions from product and engineering teams.
- Helps grow junior engineers by providing guidance on modern software development frameworks, and leading technical discussions.
- Notes gaps on the team and provides suggestions for changes to make the team more productive.
Preferred Qualifications
- 3-5 years of relevant work experience in site reliability engineering or related field
- Experience in monitoring and observability, including designing and implementing observability solutions using OpenTelemetry, Prometheus, and distributed tracing
- Proficiency in cloud platforms (GCP preferred) and infrastructure as code (Terraform, Ansible)
- Experience in programming languages such as, Go, Python, and Java
- Experience with creating and executing unit, functional, destructive, and performance tests
- Experience with modern debugging and root cause analysis techniques
- Experience in designing systems for High Availability, Disaster Recovery, Performance, Efficiency, and Security
- Experience in leading observability initiatives, including defining instrumentation standards and building monitoring dashboards
- Hands-on experience implementing alerting thresholds and automated responses based on service level objectives (SLOs)
- Strong experience with Kubernetes cluster management, optimization, and scaling
- Expertise in container orchestration, including best practices for containerized application deployments and resource optimization
- Experience designing, building, and maintaining scalable cloud infrastructure on GCP
- Proficiency in automating routine operational tasks to reduce toil and improve efficiency
- Familiarity with integrating observability-driven alerts with incident management systems and leading incident response efforts
- Experience optimizing system performance, identifying and resolving bottlenecks, and conducting capacity planning
- Knowledge of database performance tuning, query optimization, and designing application stress testing methodologies
- Familiarity with service mesh technologies (Istio, Linkerd)