Posted in

Senior Manager – Site Reliability Engineering – SRE – Digital Banking

Senior Manager – Site Reliability Engineering – SRE – Digital Banking

CompanyBank of Montreal
LocationToronto, ON, Canada
Salary$94600 – $176000
TypeFull-Time
DegreesBachelor’s
Experience LevelSenior, Expert or higher

Requirements

  • Hands-on troubleshooting skills in complex, distributed, or high-availability technical environments.
  • Experience in observability, monitoring, and incident management for critical platforms.
  • Demonstrated leadership in technical settings—may include leading projects, initiatives, or mentoring teams, even if not previously a formal people manager.
  • Strong ability to provide oversight and strategic direction for reporting and analytics frameworks, ensuring alignment with organizational goals.
  • Excellent communicator, able to translate technical detail for both engineers and executives.

Responsibilities

  • Provide strategic oversight for incident resolution efforts led by the SRE team, ensuring rapid restoration and comprehensive root cause analysis (RCA).
  • Collaborate across engineering, platform, and security teams to troubleshoot issues spanning full-stack environments (cloud, container, and legacy platforms).
  • Maintain high availability and performance of digital banking applications (primarily AWS, OpenShift, Linux, with some legacy WebSphere).
  • Champion proactive monitoring, observability, and alerting (e.g., Dynatrace, OpenSearch).
  • Define and implement best practices for reliability, scalability, and availability tailored to large-scale digital banking.
  • Continuously improve CI/CD pipelines, release automation, and deployment practices.
  • Drive rigorous postmortem analysis and a culture of blameless continuous improvement.
  • Optimize for scalability, redundancy, and resilience—minimizing customer impact from incidents.
  • Oversee patching and maintenance for cloud and on-prem environments (AWS, OpenShift, Red Hat VMs, some WebSphere).
  • Ensure zero-downtime patching strategies and automation to mitigate operational risk and security vulnerabilities.
  • Partner with security teams to enforce compliance, harden platforms, and remediate vulnerabilities.
  • Provide strategic direction and oversight for reporting frameworks and analytics capabilities, ensuring actionable insights into platform reliability and operational performance.
  • Collaborate with teams to refine dashboards, metrics, and reporting tools that provide clear visibility for stakeholders and leadership.
  • Drive initiatives to improve data accuracy and alignment with organizational goals, ensuring reporting supports decision-making and strategic priorities.
  • Lead, mentor, and grow a high-performing team of 8–10 SREs.
  • Drive a culture of ownership, operational excellence, and continuous learning.
  • Establish and enforce best practices for incident management, operational documentation, and process automation.
  • Collaborate with development, infrastructure, and product teams to enhance observability, deployment, and proactive issue detection.

Preferred Qualifications

  • Intermediate level of proficiency in DevOps.
  • Intermediate level of proficiency in Cybersecurity and privacy concepts, principles and solutions.
  • Intermediate level of proficiency in Emotional agility.
  • Advanced level of proficiency in IT infrastructure library.
  • Advanced level of proficiency in Robot Process Automation.
  • Advanced level of proficiency in Cloud Computing.
  • Advanced level of proficiency in Configuration Management.
  • Advanced level of proficiency in Container Orchestration.
  • Advanced level of proficiency in System Design and Implementation.
  • Advanced level of proficiency in Incident management.
  • Advanced level of proficiency in Learning Agility.
  • Advanced level of proficiency in Building and managing relationships.
  • Advanced level of proficiency in API Management.
  • Advanced level of proficiency in Automation and Automation Pipelines.
  • Advanced level of proficiency in Automated Testing.
  • Advanced level of proficiency in Quality Assurance and Control.
  • Advanced level of proficiency in Verbal & written communication skills.
  • Advanced level of proficiency in Analytical and problem solving skills.
  • Advanced level of proficiency in Collaboration & team skills; with a focus on cross-group collaboration.
  • Advanced level of proficiency in Able to manage ambiguity.
  • Advanced level of proficiency in Data driven decision making.
  • Typically 7+ years of relevant experience and post-secondary degree in related field of study or an equivalent combination of education and experience.