Site Reliability Principal Specialist at Sherweb responsible for enhancing reliability across IT Operations platforms and services. Implementing proactive and scalable approaches to site reliability while influencing technical direction.
Responsibilities
Implement a proactive, resilient, and scalable approach to site reliability across all Sherweb platforms.
Shape how reliability is designed, governed, and sustained across systems.
Elevate reliability from reactive operations to an engineered discipline ensuring platforms operate predictably.
Define and evolve reliability standards across platforms and services, including service level objectives (SLOs), service level indicators (SLIs).
Establish a shared reliability language and expectations across IT Operations Teams.
Drive consistency in monitoring and operational practices across services, systems and platforms.
Influence system and operational design to improve reliability, availability and resilience.
Drive the reduction of operational toil through automation, AI, platform capabilities, and repeatable operational patterns.
Improve end to end observability and system understanding.
Enable teams to take end to end ownership of platform reliability.
Partner closely with infrastructure and platform teams to ensure access, tooling, and visibility support full operational ownership.
Act as a reliability advocate and technical advisor during operational reviews, incident learning, and platform evolution.
Partner closely with DevOps teams to implement reliability and observability as code, ensuring integration with CI/CD pipelines and platform tooling.
Requirements
Bachelor’s degree in Computer Science, Engineering, Information Technology, or a related field, or equivalent practical experience.
10+ years of experience in Site Reliability Engineering, operating and improving largescale, production environments.
Demonstrated experience improving the reliability, availability, and scalability of production systems, platforms and services.
Hands-on experience operating distributed systems in business critical and customer facing environments.
Proven experience reducing manual operational work through automation and standardization.
Experience defining and applying reliability standards (e.g., SLOs, error budgets) across multiple services or platforms.
Demonstrated ability to influence technical direction across multiple teams without direct authority.
Strong understanding of distributed systems, failure modes, and operational resilience.
Solid experience with observability practices (metrics, logs, traces) and system diagnostics.
Ability to analyze complex systems end to end across infrastructure, platform, and application layers.
Strong systems thinking with a track record of addressing reliability issues through design rather than reactive intervention.
Experience acting as a trusted technical advisor to senior engineers and leaders.
Ability to clearly communicate complex reliability concepts to both technical and non-technical stakeholders.
Benefits
A fast-paced work environment that adapts to you
A friendly and diverse work culture with inclusion and equality at the heart of our actions
State-of-the art technology and tools
A results-oriented culture where talent, action, and thinking outside the box are given due recognition
Annual salary review based on performance
Generous and caring colleagues of various professional and cultural backgrounds
A flexible total compensation offer
Vacation time that considers your previous experience
Advanced paid hours to recharge your batteries (holidays and mobile days)
Flexible benefits plan that adapts to your needs
Flexible savings fund option
A monthly home internet allowance
Considerable growth opportunities
A career path with opportunities to learn and grow
Proximity to your direct manager and open, honest communication to support your development
Multiple initial and on-the-job training opportunities and tools to track your progress and help you scale up in your career
"Sherweblife" - a rich calendar of activities that allow us to gather virtually and face-to-face throughout the year
Senior DevOps & Infrastructure Engineer with Windows/Azure expertise for a banking client. Design, automate, and maintain scalable infrastructure solutions.
Senior DevOps Programmer contributing to the development of a live online game at Behaviour Interactive. Designing backend systems, implementing cloud services, and collaborating with a dynamic team.
DevOps Engineer responsible for multi - cloud infrastructure across Azure, AWS, and GCP. Collaborate with teams to build CI/CD pipelines and implement automation for AI applications.
DevOps Administrator managing and automating infrastructure for a SaaS provider in Legal Tech. Collaborating with international teams while ensuring systems performance and security.
Senior SRE contractor needed for 6 - 12 month remote role in Canada. Requires 8+ years experience with Dynatrace, ELK, Splunk, PagerDuty, AKS, Terraform, and incident management.
Senior Developer / DevOps Specialist joining large - scale digital modernization initiative. Building secure, scalable cloud - native applications within an agile delivery environment.
Senior Deployment Engineer addressing complex technical integrations in AI agent deployments for customer experience. Collaborative role with technical teams and customers to optimize solutions.
We are hiring a CI/CD Engineer with strong Platform Engineering and DevOps expertise to design, build, and optimize scalable and secure CI/CD pipelines and cloud - based platforms in Toronto, ON.
DevOps Lead needed for a 6 - 12 month remote contract in Toronto, ON. Must have 10 - 12 years experience, CI/CD with Azure DevOps, Docker, Kubernetes, and scan integration.
Co - op or Intern, DevOps Engineer joining BDO Digital's AppDev team. Responsibilities include managing Azure cloud environments and building CI/CD pipelines.