Platform Reliability Engineer

Bachelor’s degree in Computer Science, Systems Engineering, Math or related (equivalent experience considered) required.
3+ years experience in a 24×7 production enterprise-class environment as an SRE or comparable role.
1+ years Kubernetes administration/support in a production environment.
1+ years Azure or comparable cloud PaaS, IaaS, and resource administration/support in a production environment.
Demonstrated composure and effectiveness in situations requiring rapid analysis, clear prioritization, and decisive action – particularly in incidents with significant business or customer impact.
Excellent problem solving and analytical skills with attention to detail and driving issues to resolution.
Experience solving problems via automation using orchestration platforms such as Ansible, Azure Automation, and ServiceNow Flows.
Proficient with scripting languages (multiple preferred): Bash, PowerShell, Python, and JavaScript.
Proficient with data tier languages: TSQL.
Proficient with the following monitoring solutions (multiple preferred): Splunk, Prometheus/Grafana, ThousandEyes, Application Insights, Azure Monitor, and Microsoft SCOM.
Proficient with modern SRE and Observability concepts (eg. OTEL, service level management, etc).

Deliver solutions that enhance the overall reliability of the platform and/or reduce toil.
Establish modern observability patterns and implement those patterns.
Monitor the overall platform health as well as manage overall uptime and availability.
Operationalization of services including system testing, instrumentation, monitoring, capacity model development, training, and transition to operation teams.
Manage deployments of major releases.
Lead and coordinate resolution efforts during major incidents by serving as the incident commander.
Participate in an equitable 24×7 on-call rotation—serving as first responder for production alerts and escalation point for other teams.

Academic coursework in Algorithms, Data Structures, Distributed Systems, and Information Security.
1+ year(s) serving as incident commander for major incidents.
Proficient with networking and troubleshooting (ie. addressing, routing, DNS, load balancing, mesh networking).
Ability to debug and optimize infrastructure as code pipelines using Ansible, Terraform, and Azure ARM.
Proficient with ITSMITIL practices such as service management, change management, incident management, and problem management particularly in ServiceNow.
Experience designing large-scale distributed systems.
Experience designing and developing software oriented towards systems or network automation.
Proficient with administration, automation, and orchestration of large-scale Windows and Linux environments using configuration management solutions such as DSC and Ansible.
Experience operating in large SQL databases with complex business logic.
Experience utilizing MLAI technologies to accelerate your work.
Experience with Healthcare industry HIPAA regulations (similar regulated industry experience considered ie. PCI, SOX)
Experience working in an Agile and/or SAFe environment.