Posted in

Staff Software Engineer – Reliability Engineer – Observability

Staff Software Engineer – Reliability Engineer – Observability

CompanyThe Home Depot
LocationGeorgia, USA
Salary$120000 – $190000
TypeFull-Time
DegreesBachelor’s
Experience LevelSenior

Requirements

  • Must be eighteen years of age or older.
  • Must be legally permitted to work in the United States.
  • The knowledge, skills and abilities typically acquired through the completion of a bachelor’s degree program or equivalent degree in a field of study related to the job.
  • 3 years of work experience.

Responsibilities

  • Develops, tests, deploys, and maintains software, with a clear understanding of the value the software is to provide.
  • Takes a broad view when approaching issues; using a global lens.
  • Consistently achieves results, even under tough circumstances.
  • Develops test suites (functional, destructive, etc) to enable success, rapid deployment of code to production.
  • Takes on new opportunities and tough challenges with a sense of urgency, high energy and enthusiasm.
  • Actively seeks ways to grow and be challenged using both formal and informal development channels.
  • Learns through successful and failed experiment when tackling new problems.
  • Creates new and better ways for the organization to be successful.
  • Delivers multi-mode communications that convey a clear understanding of the unique needs of different audiences.
  • Works the Product Team to ensure user stories are developer ready, easy to understand and testable.
  • Collaborates with other team members in agile processes.
  • Relates openly and comfortably with diverse groups of people.
  • Adapts approach and demeanor in real time to match the shifting demands of different situations.
  • Fields questions from product and engineering teams.
  • Helps grow junior engineers by providing guidance on modern software development frameworks, and leading technical discussions.
  • Notes gaps on the team and provides suggestions for changes to make the team more productive.

Preferred Qualifications

  • 3-5 years of relevant work experience in site reliability engineering or related field
  • Experience in monitoring and observability, including designing and implementing observability solutions using OpenTelemetry, Prometheus, and distributed tracing
  • Proficiency in cloud platforms (GCP preferred) and infrastructure as code (Terraform, Ansible)
  • Experience in programming languages such as, Go, Python, and Java
  • Experience with creating and executing unit, functional, destructive, and performance tests
  • Experience with modern debugging and root cause analysis techniques
  • Experience in designing systems for High Availability, Disaster Recovery, Performance, Efficiency, and Security
  • Experience in leading observability initiatives, including defining instrumentation standards and building monitoring dashboards
  • Hands-on experience implementing alerting thresholds and automated responses based on service level objectives (SLOs)
  • Strong experience with Kubernetes cluster management, optimization, and scaling
  • Expertise in container orchestration, including best practices for containerized application deployments and resource optimization
  • Experience designing, building, and maintaining scalable cloud infrastructure on GCP
  • Proficiency in automating routine operational tasks to reduce toil and improve efficiency
  • Familiarity with integrating observability-driven alerts with incident management systems and leading incident response efforts
  • Experience optimizing system performance, identifying and resolving bottlenecks, and conducting capacity planning
  • Knowledge of database performance tuning, query optimization, and designing application stress testing methodologies
  • Familiarity with service mesh technologies (Istio, Linkerd)