Posted in

Site Reliability Engineer – SRE

Site Reliability Engineer – SRE

CompanyHuntington Bancshares
LocationColumbus, OH, USA
Salary$Not Provided – $Not Provided
TypeFull-Time
DegreesBachelor’s
Experience LevelMid Level, Senior

Requirements

  • Bachelor’s degree in computer science, Information Technology
  • 3+ years of experience in site reliability engineering, DevOps, systems administration, or related roles.

Responsibilities

  • Lead troubleshooting efforts for high-impact production issues, providing detailed root cause analysis (RCA) and preventative measures.
  • Participate in on-call rotations, acting as an escalation point for Level 1 SREs during major incidents.
  • Develop and maintain automation scripts and infrastructure using tools like Terraform, Ansible, or CloudFormation.
  • Implement automation solutions to eliminate manual tasks and improve system reliability, scalability, and performance.
  • Analyze system performance and recommend optimizations for scalability and reliability.
  • Support capacity planning efforts by monitoring system metrics, traffic patterns, and usage trends to predict future resource needs.
  • Collaborate with software engineering teams to influence the design of new services and applications, ensuring they are scalable, reliable, and resilient from the start.
  • Contribute to architectural decisions, ensuring alignment with best practices in fault tolerance, redundancy, and recovery.
  • Build and maintain robust monitoring, alerting, and observability solutions to proactively detect and resolve issues before they impact end users.
  • Optimize existing monitoring tools (e.g., Prometheus, Grafana, Datadog, Dynatrace) and build custom dashboards for better visibility into system health.
  • Ensure systems and infrastructure are secure, compliant, and aligned with organizational policies and industry best practices.
  • Assist with vulnerability management, system patching, and implementing security measures to protect the integrity and availability of services.
  • Lead efforts to continuously improve operational processes, tools, and workflows.
  • Implement and enforce best practices in deployment, monitoring, and incident management to improve overall system reliability and reduce downtime.

Preferred Qualifications

  • Strong experience with Linux/Unix administration and proficiency in scripting (e.g., Python, Bash, Go).
  • Deep understanding of cloud platforms (AWS, GCP, Azure) and related services (EC2, S3, Lambda, Kubernetes, etc.).
  • Experience with containerization and orchestration technologies like Docker and Kubernetes.
  • Proven track record of managing complex infrastructure, troubleshooting production issues, and optimizing system performance.
  • Proficiency with monitoring and observability tools such as dynatrace, Prometheus, Grafana, Datadog, ELK Stack, or similar platforms.
  • Strong understanding of networking fundamentals (DNS, HTTP, TCP/IP), load balancing, and CDNs.
  • Experience with CI/CD tools (Jenkins, GitLab CI, CircleCI) and infrastructure automation (Terraform, Ansible, Puppet).
  • Familiarity with distributed systems and microservices architecture.
  • Excellent problem-solving and troubleshooting skills, especially in diagnosing production issues in high-scale environments.
  • Microsoft Office experience
  • Experience working in multi-platform environment
  • Ability to balance both development and support roles
  • Experience in working on projects that involve business segments
  • Strong analytical, strong troubleshooting skills and excellent communication skills
  • Strong interpersonal skills, focus on customer service, and the ability to work well with other IT, vendor, and business groups