Skip to content

Senior Manager – Site Reliability Engineering – SRE – Digital Banking
Company | Bank of Montreal |
---|
Location | Toronto, ON, Canada |
---|
Salary | $94600 – $176000 |
---|
Type | Full-Time |
---|
Degrees | Bachelor’s |
---|
Experience Level | Senior, Expert or higher |
---|
Requirements
- Hands-on troubleshooting skills in complex, distributed, or high-availability technical environments.
- Experience in observability, monitoring, and incident management for critical platforms.
- Demonstrated leadership in technical settings—may include leading projects, initiatives, or mentoring teams, even if not previously a formal people manager.
- Strong ability to provide oversight and strategic direction for reporting and analytics frameworks, ensuring alignment with organizational goals.
- Excellent communicator, able to translate technical detail for both engineers and executives.
Responsibilities
- Provide strategic oversight for incident resolution efforts led by the SRE team, ensuring rapid restoration and comprehensive root cause analysis (RCA).
- Collaborate across engineering, platform, and security teams to troubleshoot issues spanning full-stack environments (cloud, container, and legacy platforms).
- Maintain high availability and performance of digital banking applications (primarily AWS, OpenShift, Linux, with some legacy WebSphere).
- Champion proactive monitoring, observability, and alerting (e.g., Dynatrace, OpenSearch).
- Define and implement best practices for reliability, scalability, and availability tailored to large-scale digital banking.
- Continuously improve CI/CD pipelines, release automation, and deployment practices.
- Drive rigorous postmortem analysis and a culture of blameless continuous improvement.
- Optimize for scalability, redundancy, and resilience—minimizing customer impact from incidents.
- Oversee patching and maintenance for cloud and on-prem environments (AWS, OpenShift, Red Hat VMs, some WebSphere).
- Ensure zero-downtime patching strategies and automation to mitigate operational risk and security vulnerabilities.
- Partner with security teams to enforce compliance, harden platforms, and remediate vulnerabilities.
- Provide strategic direction and oversight for reporting frameworks and analytics capabilities, ensuring actionable insights into platform reliability and operational performance.
- Collaborate with teams to refine dashboards, metrics, and reporting tools that provide clear visibility for stakeholders and leadership.
- Drive initiatives to improve data accuracy and alignment with organizational goals, ensuring reporting supports decision-making and strategic priorities.
- Lead, mentor, and grow a high-performing team of 8–10 SREs.
- Drive a culture of ownership, operational excellence, and continuous learning.
- Establish and enforce best practices for incident management, operational documentation, and process automation.
- Collaborate with development, infrastructure, and product teams to enhance observability, deployment, and proactive issue detection.
Preferred Qualifications
- Intermediate level of proficiency in DevOps.
- Intermediate level of proficiency in Cybersecurity and privacy concepts, principles and solutions.
- Intermediate level of proficiency in Emotional agility.
- Advanced level of proficiency in IT infrastructure library.
- Advanced level of proficiency in Robot Process Automation.
- Advanced level of proficiency in Cloud Computing.
- Advanced level of proficiency in Configuration Management.
- Advanced level of proficiency in Container Orchestration.
- Advanced level of proficiency in System Design and Implementation.
- Advanced level of proficiency in Incident management.
- Advanced level of proficiency in Learning Agility.
- Advanced level of proficiency in Building and managing relationships.
- Advanced level of proficiency in API Management.
- Advanced level of proficiency in Automation and Automation Pipelines.
- Advanced level of proficiency in Automated Testing.
- Advanced level of proficiency in Quality Assurance and Control.
- Advanced level of proficiency in Verbal & written communication skills.
- Advanced level of proficiency in Analytical and problem solving skills.
- Advanced level of proficiency in Collaboration & team skills; with a focus on cross-group collaboration.
- Advanced level of proficiency in Able to manage ambiguity.
- Advanced level of proficiency in Data driven decision making.
- Typically 7+ years of relevant experience and post-secondary degree in related field of study or an equivalent combination of education and experience.