Senior Site Reliability Engineer ensuring platform reliability and performance through best practices and automation. Collaborating with engineering teams to drive incident excellence and build observability.
Responsibilities
Act as a first responder during incidents; lead root cause analysis and blameless post-mortems.
Provide input and guidance to squads on troubleshooting documentation and operational runbooks, ensuring they are practical and effective for production support.
Define, implement, and iterate on SLIs, SLOs, and error budgets to drive data-informed reliability decisions.
Identify and measure operational toil; build software and automation to systematically reduce it.
Conduct capacity planning and performance analysis to stay ahead of scaling challenges.
Design and evolve observability platforms (metrics, logs, traces, dashboards) that give engineering teams genuine insight into system behaviour — not just noise.
Continuously improve alert quality: reduce false positives, increase signal, and ensure every alert is actionable.
Partner with development teams to embed reliability thinking into the software delivery lifecycle — from design reviews to deployment strategies.
Champion practices like chaos engineering, progressive rollouts, and failure injection testing.
Mentor engineers across teams on reliability principles and operational best practices.
Join on-call rotations and continuously improve the on-call experience for yourself and others.
Requirements
Fluent English - ideally on native level
Education: Bachelor's or Master's in Computer Science, Engineering, or equivalent practical experience.
Experience building or significantly evolving observability and monitoring solutions (we use Prometheus, Grafana, and ELK, but we care more about your approach than your tool familiarity).
Experience with AWS.
Linux systems administration background (RHEL/CentOS).
Hands-on experience operating services on container orchestration platforms (Kubernetes preferred).
A track record of improving the reliability of production systems at scale — through better automation, observability, and process, not just firefighting.
Strong communication skills and the ability to influence engineering culture across teams.
An analytical, systems-thinking mindset — you instinctively ask "why did this fail?" and "how do we make sure it can't?"
Nice to Have
Infrastructure-as-code and configuration management experience (Terraform, Ansible).
Strong scripting and automation skills (Bash, Python, or Go) — you're comfortable writing the glue that keeps systems healthy and eliminates repetitive work.
Infrastructure Engineer/SRE responsible for core infrastructure design and building tools for AI - driven contact center solutions. Join a leading AI company impacting the future of work.
DevOps Engineer intern at Sun Life focusing on Java applications and working with Docker and Kubernetes. Engage in collaborative, agile practices with the DevOps team.
Senior Developer, DevOps responsible for Azure infrastructure and automation at Radio - Canada. Collaborating with development teams to ensure optimal performance, availability, and security for digital media services.
Senior Analyst on Data Platform DevOps at AIMCo, responsible for building data operations and collaborating with teams on innovative solutions. Focused on ensuring data quality and integrity across technologies.
Site Reliability Engineer ensuring reliability, availability, and performance of Hiive's platform. Collaborating with cross - functional teams to build scalable and resilient infrastructure while supporting AI systems.
AI Security Control Developer/Site Reliability Engineer for RBC's enterprise AI ecosystem. Design, implement, and validate security controls to protect AI systems with 24/7 reliability.
DevOps Engineering Manager leading a team to improve SDLC at Vancity, Canada's largest Living Wage Employer. Collaborating across teams for reliable delivery of mission - critical systems.