Site Reliability Engineer – SRE
Company | Huntington Bancshares |
---|---|
Location | Columbus, OH, USA |
Salary | $Not Provided – $Not Provided |
Type | Full-Time |
Degrees | Bachelor’s |
Experience Level | Mid Level, Senior |
Requirements
- Bachelor’s degree in computer science, Information Technology
- 3+ years of experience in site reliability engineering, DevOps, systems administration, or related roles.
Responsibilities
- Lead troubleshooting efforts for high-impact production issues, providing detailed root cause analysis (RCA) and preventative measures.
- Participate in on-call rotations, acting as an escalation point for Level 1 SREs during major incidents.
- Develop and maintain automation scripts and infrastructure using tools like Terraform, Ansible, or CloudFormation.
- Implement automation solutions to eliminate manual tasks and improve system reliability, scalability, and performance.
- Analyze system performance and recommend optimizations for scalability and reliability.
- Support capacity planning efforts by monitoring system metrics, traffic patterns, and usage trends to predict future resource needs.
- Collaborate with software engineering teams to influence the design of new services and applications, ensuring they are scalable, reliable, and resilient from the start.
- Contribute to architectural decisions, ensuring alignment with best practices in fault tolerance, redundancy, and recovery.
- Build and maintain robust monitoring, alerting, and observability solutions to proactively detect and resolve issues before they impact end users.
- Optimize existing monitoring tools (e.g., Prometheus, Grafana, Datadog, Dynatrace) and build custom dashboards for better visibility into system health.
- Ensure systems and infrastructure are secure, compliant, and aligned with organizational policies and industry best practices.
- Assist with vulnerability management, system patching, and implementing security measures to protect the integrity and availability of services.
- Lead efforts to continuously improve operational processes, tools, and workflows.
- Implement and enforce best practices in deployment, monitoring, and incident management to improve overall system reliability and reduce downtime.
Preferred Qualifications
- Strong experience with Linux/Unix administration and proficiency in scripting (e.g., Python, Bash, Go).
- Deep understanding of cloud platforms (AWS, GCP, Azure) and related services (EC2, S3, Lambda, Kubernetes, etc.).
- Experience with containerization and orchestration technologies like Docker and Kubernetes.
- Proven track record of managing complex infrastructure, troubleshooting production issues, and optimizing system performance.
- Proficiency with monitoring and observability tools such as dynatrace, Prometheus, Grafana, Datadog, ELK Stack, or similar platforms.
- Strong understanding of networking fundamentals (DNS, HTTP, TCP/IP), load balancing, and CDNs.
- Experience with CI/CD tools (Jenkins, GitLab CI, CircleCI) and infrastructure automation (Terraform, Ansible, Puppet).
- Familiarity with distributed systems and microservices architecture.
- Excellent problem-solving and troubleshooting skills, especially in diagnosing production issues in high-scale environments.
- Microsoft Office experience
- Experience working in multi-platform environment
- Ability to balance both development and support roles
- Experience in working on projects that involve business segments
- Strong analytical, strong troubleshooting skills and excellent communication skills
- Strong interpersonal skills, focus on customer service, and the ability to work well with other IT, vendor, and business groups