Staff Software Engineer - Reliability Engineer - Observability

Staff Software Engineer – Reliability Engineer – Observability

Company	The Home Depot
Location	Georgia, USA
Salary	$120000 – $190000
Type	Full-Time
Degrees	Bachelor’s
Experience Level	Senior

Requirements

Must be eighteen years of age or older.
Must be legally permitted to work in the United States.
The knowledge, skills and abilities typically acquired through the completion of a bachelor’s degree program or equivalent degree in a field of study related to the job.
3 years of work experience.

Responsibilities

Develops, tests, deploys, and maintains software, with a clear understanding of the value the software is to provide.
Takes a broad view when approaching issues; using a global lens.
Consistently achieves results, even under tough circumstances.
Develops test suites (functional, destructive, etc) to enable success, rapid deployment of code to production.
Takes on new opportunities and tough challenges with a sense of urgency, high energy and enthusiasm.
Actively seeks ways to grow and be challenged using both formal and informal development channels.
Learns through successful and failed experiment when tackling new problems.
Creates new and better ways for the organization to be successful.
Delivers multi-mode communications that convey a clear understanding of the unique needs of different audiences.
Works the Product Team to ensure user stories are developer ready, easy to understand and testable.
Collaborates with other team members in agile processes.
Relates openly and comfortably with diverse groups of people.
Adapts approach and demeanor in real time to match the shifting demands of different situations.
Fields questions from product and engineering teams.
Helps grow junior engineers by providing guidance on modern software development frameworks, and leading technical discussions.
Notes gaps on the team and provides suggestions for changes to make the team more productive.

Preferred Qualifications

3-5 years of relevant work experience in site reliability engineering or related field
Experience in monitoring and observability, including designing and implementing observability solutions using OpenTelemetry, Prometheus, and distributed tracing
Proficiency in cloud platforms (GCP preferred) and infrastructure as code (Terraform, Ansible)
Experience in programming languages such as, Go, Python, and Java
Experience with creating and executing unit, functional, destructive, and performance tests
Experience with modern debugging and root cause analysis techniques
Experience in designing systems for High Availability, Disaster Recovery, Performance, Efficiency, and Security
Experience in leading observability initiatives, including defining instrumentation standards and building monitoring dashboards
Hands-on experience implementing alerting thresholds and automated responses based on service level objectives (SLOs)
Strong experience with Kubernetes cluster management, optimization, and scaling
Expertise in container orchestration, including best practices for containerized application deployments and resource optimization
Experience designing, building, and maintaining scalable cloud infrastructure on GCP
Proficiency in automating routine operational tasks to reduce toil and improve efficiency
Familiarity with integrating observability-driven alerts with incident management systems and leading incident response efforts
Experience optimizing system performance, identifying and resolving bottlenecks, and conducting capacity planning
Knowledge of database performance tuning, query optimization, and designing application stress testing methodologies
Familiarity with service mesh technologies (Istio, Linkerd)