Site Reliability Engineer - SRE

Site Reliability Engineer – SRE

Company	Huntington Bancshares
Location	Columbus, OH, USA
Salary	$Not Provided – $Not Provided
Type	Full-Time
Degrees	Bachelor’s
Experience Level	Mid Level, Senior

Requirements

Bachelor’s degree in computer science, Information Technology
3+ years of experience in site reliability engineering, DevOps, systems administration, or related roles.

Responsibilities

Lead troubleshooting efforts for high-impact production issues, providing detailed root cause analysis (RCA) and preventative measures.
Participate in on-call rotations, acting as an escalation point for Level 1 SREs during major incidents.
Develop and maintain automation scripts and infrastructure using tools like Terraform, Ansible, or CloudFormation.
Implement automation solutions to eliminate manual tasks and improve system reliability, scalability, and performance.
Analyze system performance and recommend optimizations for scalability and reliability.
Support capacity planning efforts by monitoring system metrics, traffic patterns, and usage trends to predict future resource needs.
Collaborate with software engineering teams to influence the design of new services and applications, ensuring they are scalable, reliable, and resilient from the start.
Contribute to architectural decisions, ensuring alignment with best practices in fault tolerance, redundancy, and recovery.
Build and maintain robust monitoring, alerting, and observability solutions to proactively detect and resolve issues before they impact end users.
Optimize existing monitoring tools (e.g., Prometheus, Grafana, Datadog, Dynatrace) and build custom dashboards for better visibility into system health.
Ensure systems and infrastructure are secure, compliant, and aligned with organizational policies and industry best practices.
Assist with vulnerability management, system patching, and implementing security measures to protect the integrity and availability of services.
Lead efforts to continuously improve operational processes, tools, and workflows.
Implement and enforce best practices in deployment, monitoring, and incident management to improve overall system reliability and reduce downtime.

Preferred Qualifications

Strong experience with Linux/Unix administration and proficiency in scripting (e.g., Python, Bash, Go).
Deep understanding of cloud platforms (AWS, GCP, Azure) and related services (EC2, S3, Lambda, Kubernetes, etc.).
Experience with containerization and orchestration technologies like Docker and Kubernetes.
Proven track record of managing complex infrastructure, troubleshooting production issues, and optimizing system performance.
Proficiency with monitoring and observability tools such as dynatrace, Prometheus, Grafana, Datadog, ELK Stack, or similar platforms.
Strong understanding of networking fundamentals (DNS, HTTP, TCP/IP), load balancing, and CDNs.
Experience with CI/CD tools (Jenkins, GitLab CI, CircleCI) and infrastructure automation (Terraform, Ansible, Puppet).
Familiarity with distributed systems and microservices architecture.
Excellent problem-solving and troubleshooting skills, especially in diagnosing production issues in high-scale environments.
Microsoft Office experience
Experience working in multi-platform environment
Ability to balance both development and support roles
Experience in working on projects that involve business segments
Strong analytical, strong troubleshooting skills and excellent communication skills
Strong interpersonal skills, focus on customer service, and the ability to work well with other IT, vendor, and business groups