Lead Site Reliability Engineer ensuring scalable, resilient services for Movable Ink at high volume content platform. Design and drive automation strategies while mentoring engineering teams.
Responsibilities
Define and drive the automation strategy for infrastructure tooling, establishing standards that minimize manual work, increase performance and reduce incident frequency and severity of incidents
Own the design, reliability and evolution of core platform applications, mentoring team members on best practices and ensuring systems meet long-term business objectives
Architect and lead the logging platform strategy, driving its design and balancing availability, retention and cost optimization
Establish capacity planning and performance management frameworks, proactively identifying scaling opportunities and guiding teams through complex troubleshooting scenarios
Lead cross-functional reliability initiatives with SRE and service engineering teams, influencing architectural decisions and championing practices that ensure resilient service delivery
Demonstrate a high level of autonomy in anticipating, identifying, and addressing systemic weaknesses and opportunities for platform improvement without direct supervision.
Requirements
Proven track record in Site Reliability or Software Engineering, designing, building, and owning scalable, resilient services with a focus on long-term reliability strategy
Deep expertise in architecting and operating complex distributed systems such as Apache Pulsar, Apache Kafka, Grafana Loki, ScyllaDB/Cassandra, with the ability to guide teams through distributed system challenges
Designing and owning automation strategies to manage services at scale, with expertise in establishing performance analysis frameworks and mentoring others on diagnostics and resolution
Deep, hands-on experience (6+ years) in Site Reliability or Software Engineering, specifically leading and shaping multi-cloud architecture and strategy (AWS and GCP).
Experience architecting and leading large-scale observability platforms, including defining observability standards and SLO frameworks. We use Prometheus and Thanos with Grafana Alloy, Loki and Tempo
Experience leading on-call excellence, including driving improvements to monitoring and alerting strategies, automating runbooks and mentoring team members on incident response best practices. Every member of the SRE team does a week long on-call rotation
Expert-level proficiency with infrastructure as code, including defining IaC standards and patterns across teams. We use Terraform and Chef
Advanced Kubernetes expertise, including cluster architecture design, multi-tenancy strategies, and guiding teams on container orchestration best practices. We use EKS and GKE
Proficiency in multiple programming languages with the ability to design and review code that meets reliability standards. We use NodeJS, Golang, Ruby, Python and shell scripting
Advanced Linux systems expertise, with the ability to diagnose complex system-level issues and mentor others on performance tuning and troubleshooting.
DevOps Engineer responsible for infrastructure, CI/CD, and backend services for a major university platform. Join Robots and Pencils, building impactful digital solutions.
DevOps Specialist ensuring efficient DevOps practices within Development and QA teams at Desjardins. Collaborating with developers to optimize IT platforms and enhance deployment strategies.
Cloud/DevOps Intern developing tools for enterprise technology at TD. Working on DevOps processes and gaining hands - on experience in cloud technologies in a hybrid model.
DevOps Specialist creating and overseeing Azure hybrid cloud infrastructures for EVLO's battery energy storage solutions. Collaborating with teams to implement cutting - edge technologies in a dynamic environment.
DevOps Specialist responsible for technical expertise in Java development and AWS automation. Ensuring high - quality software solutions and a reliable infrastructure at Portage CyberTech.
Senior Site Reliability Engineer at Fable ensuring reliable and scalable infrastructure for AI - driven accessible products. Collaborating across teams to improve operational excellence and platform engineering.
Senior DevOps Engineer managing Zipline's cloud infrastructure and CI/CD systems. Collaborating with engineering teams to ensure platform reliability and scalability.
Back - End & DevOps Software Developer contributing to building digital products to change the world. Specializing in back - end development and command of DevOps ecosystem for robust infrastructure.
Storage Technical Analyst providing global support for RBC's storage and backups infrastructure. Mentor operations staff and manage automation solutions for advanced incident management.
Infrastructure Engineer/SRE responsible for core infrastructure design and building tools for AI - driven contact center solutions. Join a leading AI company impacting the future of work.