Posted in

Site Reliability Engineer

Site Reliability Engineer

CompanyMetroStar
LocationBedford, MA, USA
Salary$Not Provided – $Not Provided
TypeFull-Time
DegreesBachelor’s
Experience LevelMid Level, Senior

Requirements

  • Possess an active Secret U.S. Government security clearance or higher
  • Bachelor’s degree in Computer Science, Information Technology, or a related field.
  • Minimum of 3 years of professional experience in a Site Reliability Engineering role or similar capacity.
  • Strong experience with cloud technologies (e.g., AWS, Azure, GCP) and infrastructure as code (e.g., Terraform, Ansible).
  • Proficiency in managing, leading, and engineering incident and outage response
  • Strong engineering experience in network protocols (e.g., TCP/IP, DNS, HTTP/HTTPS, Load Balancing, etc.)
  • Proficiency in programming and scripting languages (e.g., Python, Go, Bash) and RPA (e.g. Blue Prism, UIPath) to automate tasks and develop tools.
  • Deep understanding of containerization and orchestration technologies (e.g., Kubernetes, Docker).
  • Expertise in implementing and managing monitoring and logging solutions (e.g., Splunk, Prometheus, Grafana, ELK stack).
  • Familiarity with CI/CD pipeline development and management (e.g., GitLab CI, Azure DevOps, AWS Lambda, Jenkins)
  • Proven track record of designing, building, and maintaining highly available and scalable systems.
  • Expert proficiency in developing automated functional, regression and performance tests and developing automated testing standards for development teams.
  • Experience facilitating change and configuration management processes to drive reliability.
  • Strong problem-solving skills, with the ability to diagnose complex issues and implement effective solutions.
  • Excellent communication skills, with the ability to collaborate effectively across diverse teams.

Responsibilities

  • Collaborate with cross-functional teams to identify performance bottlenecks, troubleshoot complex issues, and optimize system performance to meet defined service level objectives.
  • Design and implement monitoring, alerting, and incident response strategies to proactively identify and mitigate potential issues, ensuring uninterrupted service availability.
  • Drive automation initiatives to streamline deployment, configuration management, and infrastructure provisioning processes.
  • Develop and maintain comprehensive documentation for system configurations, processes, and procedures.
  • Participate in on-call rotations and respond to incidents, working diligently to resolve issues and prevent recurrence.

Preferred Qualifications

    No preferred qualifications provided.