Reliability Engineer
Company | Hartford Financial Services |
---|---|
Location | Chicago, IL, USA, Charlotte, NC, USA, Columbus, OH, USA, Hartford, CT, USA |
Salary | $90320 – $135480 |
Type | Full-Time |
Degrees | Bachelor’s, Master’s |
Experience Level | Mid Level, Senior |
Requirements
- Bachelor’s or Master’s degree in Computer Science, Engineering, or a related field.
- 3+ years of experience in Infrastructure Engineering, Site Reliability Engineering (SRE), or DevOps.
- Hands-on experience with observability tools: Splunk, Dynatrace, CloudWatch.
- Deep knowledge of Infrastructure as Code (IaC) with Terraform, CloudFormation.
- Proven ability to optimize CI/CD pipelines, automate deployments, and enforce DevSecOps best practices.
- Expertise in cloud platforms (AWS) and Kubernetes-based microservices environments.
- Strong proficiency in Python, Java for infrastructure automation and tooling development.
- Demonstrated experience working within Agile frameworks and methodologies.
- Excellent analytical, problem solving and interpersonal skills.
Responsibilities
- Assist in the use of best-in-class software engineering standards and design practices for instrumenting code/application technology stack to enable the generation of relevant metrics on overall technology health – availability, performance, quality, currency and resiliency.
- Assist the architecture and software engineering teams to influence the technical strategy for the organization, keeping in mind its cross-functional impacts, integration across the organization, and architecture rationalization.
- Assist on a team as a technical leader for the applications supported, requiring depth and breadth of knowledge in technologies, applications, integration, interfaces and business domain.
- Assist in developing effective tooling, alerts, and response mechanisms to identify and address reliability risks leveraging automation to support problem prevention, detection, mitigation, and resolution.
- Assist in enhancing the delivery flow by engineering the appropriate solutions to increase delivery speed while adhering to technology standards for sustained reliability.
- Partner to implement preventative controls and drive increased automation and self-healing capabilities. Continue to improve cost efficiency baselines.
- Promote and implement innovative solutions.
- Ensure operational excellence. Collaborate to drive the triaging and service restoration of all high impact incidents in order to minimize the mean time to service restoration and impact to the business. Demonstrate end-to-end ownership.
- Partner with infrastructure teams to design and implement intelligent incident routing, enhanced monitoring/alerting capabilities and automated service restoration processes. Take proactive measures to prevent high impactful incidents.
- Achieve and maintain the continuity of Hartford and third-party assets that support a business function. Accountable for keeping the IT application and infrastructure metadata repositories current.
- Research and implement AI-based anomaly detection to predict infrastructure failures and automate preventive measures.
- Develop AI-powered troubleshooting copilots and LLM-driven operational assistants to accelerate incident resolution and root cause analysis.
- Implement AI/ML-based runbooks to automate system recovery and optimize operational efficiency.
Preferred Qualifications
- Experience in AI/ML frameworks for observability, predictive failure detection, and AI-driven troubleshooting desirable.
- Experience with Oracle and SQL Server relational database technologies. Knowledge of open-source database technologies is beneficial.