Platform Reliability Engineer
Company | Hearst |
---|---|
Location | Dallas, TX, USA |
Salary | $Not Provided – $Not Provided |
Type | Full-Time |
Degrees | Bachelor’s |
Experience Level | Mid Level, Senior |
Requirements
- Bachelor’s degree in Computer Science, Systems Engineering, Math or related (equivalent experience considered) required.
- 3+ years experience in a 24×7 production enterprise-class environment as an SRE or comparable role.
- 1+ years Kubernetes administration/support in a production environment.
- 1+ years Azure or comparable cloud PaaS, IaaS, and resource administration/support in a production environment.
- Demonstrated composure and effectiveness in situations requiring rapid analysis, clear prioritization, and decisive action – particularly in incidents with significant business or customer impact.
- Excellent problem solving and analytical skills with attention to detail and driving issues to resolution.
- Experience solving problems via automation using orchestration platforms such as Ansible, Azure Automation, and ServiceNow Flows.
- Proficient with scripting languages (multiple preferred): Bash, PowerShell, Python, and JavaScript.
- Proficient with data tier languages: TSQL.
- Proficient with the following monitoring solutions (multiple preferred): Splunk, Prometheus/Grafana, ThousandEyes, Application Insights, Azure Monitor, and Microsoft SCOM.
- Proficient with modern SRE and Observability concepts (eg. OTEL, service level management, etc).
Responsibilities
- Deliver solutions that enhance the overall reliability of the platform and/or reduce toil.
- Establish modern observability patterns and implement those patterns.
- Monitor the overall platform health as well as manage overall uptime and availability.
- Operationalization of services including system testing, instrumentation, monitoring, capacity model development, training, and transition to operation teams.
- Manage deployments of major releases.
- Lead and coordinate resolution efforts during major incidents by serving as the incident commander.
- Participate in an equitable 24×7 on-call rotation—serving as first responder for production alerts and escalation point for other teams.
Preferred Qualifications
- Academic coursework in Algorithms, Data Structures, Distributed Systems, and Information Security.
- 1+ year(s) serving as incident commander for major incidents.
- Proficient with networking and troubleshooting (ie. addressing, routing, DNS, load balancing, mesh networking).
- Ability to debug and optimize infrastructure as code pipelines using Ansible, Terraform, and Azure ARM.
- Proficient with ITSMITIL practices such as service management, change management, incident management, and problem management particularly in ServiceNow.
- Experience designing large-scale distributed systems.
- Experience designing and developing software oriented towards systems or network automation.
- Proficient with administration, automation, and orchestration of large-scale Windows and Linux environments using configuration management solutions such as DSC and Ansible.
- Experience operating in large SQL databases with complex business logic.
- Experience utilizing MLAI technologies to accelerate your work.
- Experience with Healthcare industry HIPAA regulations (similar regulated industry experience considered ie. PCI, SOX)
- Experience working in an Agile and/or SAFe environment.