Posted in

Platform Reliability Engineer

Platform Reliability Engineer

CompanyHearst
LocationDallas, TX, USA
Salary$Not Provided – $Not Provided
TypeFull-Time
DegreesBachelor’s
Experience LevelMid Level, Senior

Requirements

  • Bachelor’s degree in Computer Science, Systems Engineering, Math or related (equivalent experience considered) required.
  • 3+ years experience in a 24×7 production enterprise-class environment as an SRE or comparable role.
  • 1+ years Kubernetes administration/support in a production environment.
  • 1+ years Azure or comparable cloud PaaS, IaaS, and resource administration/support in a production environment.
  • Demonstrated composure and effectiveness in situations requiring rapid analysis, clear prioritization, and decisive action – particularly in incidents with significant business or customer impact.
  • Excellent problem solving and analytical skills with attention to detail and driving issues to resolution.
  • Experience solving problems via automation using orchestration platforms such as Ansible, Azure Automation, and ServiceNow Flows.
  • Proficient with scripting languages (multiple preferred): Bash, PowerShell, Python, and JavaScript.
  • Proficient with data tier languages: TSQL.
  • Proficient with the following monitoring solutions (multiple preferred): Splunk, Prometheus/Grafana, ThousandEyes, Application Insights, Azure Monitor, and Microsoft SCOM.
  • Proficient with modern SRE and Observability concepts (eg. OTEL, service level management, etc).

Responsibilities

  • Deliver solutions that enhance the overall reliability of the platform and/or reduce toil.
  • Establish modern observability patterns and implement those patterns.
  • Monitor the overall platform health as well as manage overall uptime and availability.
  • Operationalization of services including system testing, instrumentation, monitoring, capacity model development, training, and transition to operation teams.
  • Manage deployments of major releases.
  • Lead and coordinate resolution efforts during major incidents by serving as the incident commander.
  • Participate in an equitable 24×7 on-call rotation—serving as first responder for production alerts and escalation point for other teams.

Preferred Qualifications

  • Academic coursework in Algorithms, Data Structures, Distributed Systems, and Information Security.
  • 1+ year(s) serving as incident commander for major incidents.
  • Proficient with networking and troubleshooting (ie. addressing, routing, DNS, load balancing, mesh networking).
  • Ability to debug and optimize infrastructure as code pipelines using Ansible, Terraform, and Azure ARM.
  • Proficient with ITSMITIL practices such as service management, change management, incident management, and problem management particularly in ServiceNow.
  • Experience designing large-scale distributed systems.
  • Experience designing and developing software oriented towards systems or network automation.
  • Proficient with administration, automation, and orchestration of large-scale Windows and Linux environments using configuration management solutions such as DSC and Ansible.
  • Experience operating in large SQL databases with complex business logic.
  • Experience utilizing MLAI technologies to accelerate your work.
  • Experience with Healthcare industry HIPAA regulations (similar regulated industry experience considered ie. PCI, SOX)
  • Experience working in an Agile and/or SAFe environment.