Senior Site Reliability Engineer
Company | Credit Acceptance Careers |
---|---|
Location | Southfield, MI, USA |
Salary | $117963 – $173012 |
Type | Full-Time |
Degrees | Bachelor’s, Master’s |
Experience Level | Senior |
Requirements
- Bachelor’s or Master’s degree in Computer Science, Information Technology, or a related field.
- Proven experience as a Site Reliability Engineer or similar role.
- Proficient in distributed systems, and modern observability practices (e.g., OpenTelemetry, Prometheus), with strong cross-functional collaboration and knowledge-sharing skills.
- Experience implementing and maintaining distributed systems using modern architectural patterns.
- In-depth knowledge of system architecture, distributed systems, and networking.
- Experience with cloud platforms (e.g., AWS, Azure, GCP) and containerization technologies (e.g., Docker, Kubernetes).
- Familiarity with continuous integration and continuous deployment (CI/CD) practices.
- Excellent troubleshooting and problem-solving skills.
- Strong communication and collaboration skills.
- Certification in relevant areas (e.g., AWS Certified DevOps Engineer, Kubernetes Certified Administrator) is a plus.
- Expertise in designing and implementing resilience patterns for distributed systems and microservices architectures, such as Circuit Breakers and Retries. Proficient in applying modern resiliency frameworks to address diverse failure scenarios.
- Ability to identify and address gaps in observability, scalability, and fault tolerance prior to deployment, ensuring systems meet reliability and performance standards throughout the SDLC.
- Develop efficient, testable, and maintainable solutions using industry best practices to enhance reliability and automate operational tasks.
- Design resilient, scalable, and cost-effective systems while evaluating the broader impact of changes on the technical ecosystem.
Responsibilities
- Collaborate with software engineers, architects, and operations teams to design highly reliable and scalable systems.
- Evaluate existing systems and propose improvements to enhance reliability, performance, and availability.
- Drive modernization initiatives, including implementing Open Telemetry collectors and transitioning to structured logging for improved observability and cost efficiency.
- Develop and implement code to automate operational processes and tasks to improve system reliability and performance.
- Create self-service tools, such as observability dashboards and automated incident analysis solutions, enabling teams to detect and resolve issues faster.
- Build and maintain scripts, pipelines, and tools for monitoring, logging, and alerting, aligned with Golden Path initiatives.
- Implement and manage monitoring solutions to proactively identify and address reliability issues.
- Participate in on-call rotations and respond promptly to incidents to minimize downtime and improve Mean Time to Restore (MTTR).
- Define and implement standardized logging schemas for improved debugging efficiency and cost optimization.
- Lead efforts to adopt Open Telemetry (OTEL) for distributed tracing, metrics, and logs, enabling better observability and scalability.
- Conduct performance analysis to identify bottlenecks and optimize system performance.
- Partner with development teams to address performance issues in the codebase and ensure systems are resilient under load.
- Collaborate with capacity planning teams to ensure systems can handle anticipated growth and demand.
- Proactively identify capacity-related challenges and propose solutions.
- Maintain comprehensive documentation for system configurations, processes, and procedures to ensure operational transparency.
- Contribute to knowledge sharing within the SRE team and across departments by creating best practice guides and conducting training sessions.
Preferred Qualifications
- Certification in relevant areas (e.g., AWS Certified DevOps Engineer, Kubernetes Certified Administrator) is a plus.