Posted in

Senior Site Reliability Engineer

Senior Site Reliability Engineer

CompanyCredit Acceptance Careers
LocationSouthfield, MI, USA
Salary$117963 – $173012
TypeFull-Time
DegreesBachelor’s, Master’s
Experience LevelSenior

Requirements

  • Bachelor’s or Master’s degree in Computer Science, Information Technology, or a related field.
  • Proven experience as a Site Reliability Engineer or similar role.
  • Proficient in distributed systems, and modern observability practices (e.g., OpenTelemetry, Prometheus), with strong cross-functional collaboration and knowledge-sharing skills.
  • Experience implementing and maintaining distributed systems using modern architectural patterns.
  • In-depth knowledge of system architecture, distributed systems, and networking.
  • Experience with cloud platforms (e.g., AWS, Azure, GCP) and containerization technologies (e.g., Docker, Kubernetes).
  • Familiarity with continuous integration and continuous deployment (CI/CD) practices.
  • Excellent troubleshooting and problem-solving skills.
  • Strong communication and collaboration skills.
  • Certification in relevant areas (e.g., AWS Certified DevOps Engineer, Kubernetes Certified Administrator) is a plus.
  • Expertise in designing and implementing resilience patterns for distributed systems and microservices architectures, such as Circuit Breakers and Retries. Proficient in applying modern resiliency frameworks to address diverse failure scenarios.
  • Ability to identify and address gaps in observability, scalability, and fault tolerance prior to deployment, ensuring systems meet reliability and performance standards throughout the SDLC.
  • Develop efficient, testable, and maintainable solutions using industry best practices to enhance reliability and automate operational tasks.
  • Design resilient, scalable, and cost-effective systems while evaluating the broader impact of changes on the technical ecosystem.

Responsibilities

  • Collaborate with software engineers, architects, and operations teams to design highly reliable and scalable systems.
  • Evaluate existing systems and propose improvements to enhance reliability, performance, and availability.
  • Drive modernization initiatives, including implementing Open Telemetry collectors and transitioning to structured logging for improved observability and cost efficiency.
  • Develop and implement code to automate operational processes and tasks to improve system reliability and performance.
  • Create self-service tools, such as observability dashboards and automated incident analysis solutions, enabling teams to detect and resolve issues faster.
  • Build and maintain scripts, pipelines, and tools for monitoring, logging, and alerting, aligned with Golden Path initiatives.
  • Implement and manage monitoring solutions to proactively identify and address reliability issues.
  • Participate in on-call rotations and respond promptly to incidents to minimize downtime and improve Mean Time to Restore (MTTR).
  • Define and implement standardized logging schemas for improved debugging efficiency and cost optimization.
  • Lead efforts to adopt Open Telemetry (OTEL) for distributed tracing, metrics, and logs, enabling better observability and scalability.
  • Conduct performance analysis to identify bottlenecks and optimize system performance.
  • Partner with development teams to address performance issues in the codebase and ensure systems are resilient under load.
  • Collaborate with capacity planning teams to ensure systems can handle anticipated growth and demand.
  • Proactively identify capacity-related challenges and propose solutions.
  • Maintain comprehensive documentation for system configurations, processes, and procedures to ensure operational transparency.
  • Contribute to knowledge sharing within the SRE team and across departments by creating best practice guides and conducting training sessions.

Preferred Qualifications

  • Certification in relevant areas (e.g., AWS Certified DevOps Engineer, Kubernetes Certified Administrator) is a plus.