Senior Site Reliability Engineer

Company	Credit Acceptance Careers
Location	Southfield, MI, USA
Salary	$117963 – $173012
Type	Full-Time
Degrees	Bachelor’s, Master’s
Experience Level	Senior

Requirements

Bachelor’s or Master’s degree in Computer Science, Information Technology, or a related field.
Proven experience as a Site Reliability Engineer or similar role.
Proficient in distributed systems, and modern observability practices (e.g., OpenTelemetry, Prometheus), with strong cross-functional collaboration and knowledge-sharing skills.
Experience implementing and maintaining distributed systems using modern architectural patterns.
In-depth knowledge of system architecture, distributed systems, and networking.
Experience with cloud platforms (e.g., AWS, Azure, GCP) and containerization technologies (e.g., Docker, Kubernetes).
Familiarity with continuous integration and continuous deployment (CI/CD) practices.
Excellent troubleshooting and problem-solving skills.
Strong communication and collaboration skills.
Certification in relevant areas (e.g., AWS Certified DevOps Engineer, Kubernetes Certified Administrator) is a plus.
Expertise in designing and implementing resilience patterns for distributed systems and microservices architectures, such as Circuit Breakers and Retries. Proficient in applying modern resiliency frameworks to address diverse failure scenarios.
Ability to identify and address gaps in observability, scalability, and fault tolerance prior to deployment, ensuring systems meet reliability and performance standards throughout the SDLC.
Develop efficient, testable, and maintainable solutions using industry best practices to enhance reliability and automate operational tasks.
Design resilient, scalable, and cost-effective systems while evaluating the broader impact of changes on the technical ecosystem.

Responsibilities

Collaborate with software engineers, architects, and operations teams to design highly reliable and scalable systems.
Evaluate existing systems and propose improvements to enhance reliability, performance, and availability.
Drive modernization initiatives, including implementing Open Telemetry collectors and transitioning to structured logging for improved observability and cost efficiency.
Develop and implement code to automate operational processes and tasks to improve system reliability and performance.
Create self-service tools, such as observability dashboards and automated incident analysis solutions, enabling teams to detect and resolve issues faster.
Build and maintain scripts, pipelines, and tools for monitoring, logging, and alerting, aligned with Golden Path initiatives.
Implement and manage monitoring solutions to proactively identify and address reliability issues.
Participate in on-call rotations and respond promptly to incidents to minimize downtime and improve Mean Time to Restore (MTTR).
Define and implement standardized logging schemas for improved debugging efficiency and cost optimization.
Lead efforts to adopt Open Telemetry (OTEL) for distributed tracing, metrics, and logs, enabling better observability and scalability.
Conduct performance analysis to identify bottlenecks and optimize system performance.
Partner with development teams to address performance issues in the codebase and ensure systems are resilient under load.
Collaborate with capacity planning teams to ensure systems can handle anticipated growth and demand.
Proactively identify capacity-related challenges and propose solutions.
Maintain comprehensive documentation for system configurations, processes, and procedures to ensure operational transparency.
Contribute to knowledge sharing within the SRE team and across departments by creating best practice guides and conducting training sessions.

Preferred Qualifications

Certification in relevant areas (e.g., AWS Certified DevOps Engineer, Kubernetes Certified Administrator) is a plus.