Principal Specialist implementing site reliability across Sherweb platforms. Elevating reliability from reactive operations to an engineered discipline.
Responsibilities
Implement a proactive, resilient, and scalable approach to site reliability across all Sherweb platforms
Shape how reliability is designed, governed, and sustained across systems
Elevate reliability from reactive operations to an engineered discipline
Ensure platforms operate predictably as Sherweb grows in scale, complexity, and customer impact
Act as a principal-level technical leader across IT Operations
Set reliability direction and drive consistency through technical authority, influence, and partnership
Serve as a technical counterpart to senior engineering, infrastructure, and platform leaders
Define and evolve reliability standards across platforms and services
Establish a shared reliability language and expectations across IT Operations Teams
Drive consistency in monitoring and operational practices across services, systems and platforms
Influence system and operational design to improve reliability, availability and resilience
Drive the reduction of operational toil through automation, AI, platform capabilities, and repeatable operational patterns
Improve end to end observability and system understanding
Enable teams to take end to end ownership of platform reliability
Partner closely with infrastructure and platform teams to ensure access, tooling, and visibility support full operational ownership
Act as a reliability advocate and technical advisor during operational reviews, incident learning, and platform evolution
Partner closely with DevOps teams to implement reliability and observability as code
Requirements
Bachelor’s degree in Computer Science, Engineering, Information Technology, or a related field, or equivalent practical experience
10+ years of experience in Site Reliability Engineering, operating and improving largescale, production environments
Demonstrated experience improving the reliability, availability, and scalability of production systems, platforms and services
Hands-on experience operating distributed systems in business critical and customer facing environments
Proven experience reducing manual operational work through automation and standardization
Experience defining and applying reliability standards (e.g., SLOs, error budgets) across multiple services or platforms
Demonstrated ability to influence technical direction across multiple teams without direct authority
Strong understanding of distributed systems, failure modes, and operational resilience
Solid experience with observability practices (metrics, logs, traces) and system diagnostics
Ability to analyze complex systems end to end across infrastructure, platform, and application layers
Strong systems thinking with a track record of addressing reliability issues through design rather than reactive intervention
Experience acting as a trusted technical advisor to senior engineers and leaders
Ability to clearly communicate complex reliability concepts to both technical and non-technical stakeholders
Certifications related to reliability, operations, or systems engineering (e.g., Kubernetes, Linux, or observability platforms) are considered an asset. Equivalent demonstrated expertise through hands on experience is acceptable in lieu of formal certifications.
Benefits
A fast-paced work environment that adapts to you
A friendly and diverse work culture with inclusion and equality at the heart of our actions
State-of-the art technology and tools
A results-oriented culture where talent, action, and thinking outside the box are given due recognition
Annual salary review based on performance
Generous and caring colleagues of various professional and cultural backgrounds
A flexible total compensation offer
Vacation time that considers your previous experience
Advanced paid hours to recharge your batteries (holidays and mobile days)
Flexible benefits plan that adapts to your needs
Flexible savings fund option
A monthly home internet allowance
Considerable growth opportunities
A career path with opportunities to learn and grow
Proximity to your direct manager and open, honest communication to support your development
Multiple initial and on-the-job training opportunities and tools to track your progress and help you scale up in your career
“Sherweblife” - a rich calendar of activities that allow us to gather virtually and face-to-face throughout the year
Principal Site Reliability Engineer responsible for AWS infrastructure and reliability engineering. Collaborating across teams to enhance platform performance and security practices.
Junior/Intermediate DevOps Engineer role in Toronto (Hybrid). Build CI/CD pipelines with GitHub Actions, deploy Java/Spring Boot apps on OpenShift, and collaborate with DevOps teams.
Platform DevOps managing the Enterprise Data and AI Platform across AWS and Kubernetes. Implementing Infrastructure as Code with Terraform and maintaining CI/CD pipelines for secure solutions.
Lead DevOps specialized in AWS/GCP Cloud solutions for FinOps team. Driving cross - functional activation and managing cloud environments, data integrations, and automation strategies.
Skilled DevOps Engineer providing expertise in deployment automation for TD's technology solutions team. Engaging in improving development and release processes while ensuring security and system integrity.
Ingénieur fiabilité des infrastructures pour soutenir les services SaaS critiques. Collaborer, innover et optimiser la fiabilité et la performance des systèmes cloud sur AWS et Kubernetes.
DevOps Engineer to help scale cloud and on - prem environments, automating deployments and enhancing security posture for energy - intelligent compute applications.
Reliability Engineering Architect at Carbon60 managing a team to deliver AWS cloud solutions. Focus on mentoring engineers and integrating AI tools into automated systems.
DevOps Specialist taking over build, release, and environments for Sparrow’s product team. Leading DevOps practices while collaborating with CTO and senior developers in an agile setting.