About the role

Lead Site Reliability Engineer ensuring scalable, resilient services for Movable Ink at high volume content platform. Design and drive automation strategies while mentoring engineering teams.

Responsibilities

Define and drive the automation strategy for infrastructure tooling, establishing standards that minimize manual work, increase performance and reduce incident frequency and severity of incidents
Own the design, reliability and evolution of core platform applications, mentoring team members on best practices and ensuring systems meet long-term business objectives
Architect and lead the logging platform strategy, driving its design and balancing availability, retention and cost optimization
Establish capacity planning and performance management frameworks, proactively identifying scaling opportunities and guiding teams through complex troubleshooting scenarios
Lead cross-functional reliability initiatives with SRE and service engineering teams, influencing architectural decisions and championing practices that ensure resilient service delivery
Demonstrate a high level of autonomy in anticipating, identifying, and addressing systemic weaknesses and opportunities for platform improvement without direct supervision.

Requirements

Proven track record in Site Reliability or Software Engineering, designing, building, and owning scalable, resilient services with a focus on long-term reliability strategy
Deep expertise in architecting and operating complex distributed systems such as Apache Pulsar, Apache Kafka, Grafana Loki, ScyllaDB/Cassandra, with the ability to guide teams through distributed system challenges
Designing and owning automation strategies to manage services at scale, with expertise in establishing performance analysis frameworks and mentoring others on diagnostics and resolution
Deep, hands-on experience (6+ years) in Site Reliability or Software Engineering, specifically leading and shaping multi-cloud architecture and strategy (AWS and GCP).
Experience architecting and leading large-scale observability platforms, including defining observability standards and SLO frameworks. We use Prometheus and Thanos with Grafana Alloy, Loki and Tempo
Experience leading on-call excellence, including driving improvements to monitoring and alerting strategies, automating runbooks and mentoring team members on incident response best practices. Every member of the SRE team does a week long on-call rotation
Expert-level proficiency with infrastructure as code, including defining IaC standards and patterns across teams. We use Terraform and Chef
Advanced Kubernetes expertise, including cluster architecture design, multi-tenancy strategies, and guiding teams on container orchestration best practices. We use EKS and GKE
Proficiency in multiple programming languages with the ability to design and review code that meets reliability standards. We use NodeJS, Golang, Ruby, Python and shell scripting
Advanced Linux systems expertise, with the ability to diagnose complex system-level issues and mentor others on performance tuning and troubleshooting.

Benefits

full range of medical
financial
other benefits

Lead Site Reliability Engineer

at Movable Ink

Resume Score

About the role

Responsibilities

Requirements

Benefits

Job title

Job type

Experience level

Salary

Degree requirement

Tech skills

Location requirements

Report this job

Similar roles

Senior Software Engineer, Dev Ops

Conga

Staff Software Engineer, Dev Ops

Conga

Senior DevOps Engineer, AI Infrastructure

Financeit

Director, Release Train Engineer – Digital Products

Sun Life

Release Engineer

EXL

Senior Site Reliability Engineer

HighlightTA

DevSecOps

Growe Talents

Junior Site Reliability Engineer

PointClickCare

Software developer/DevOps Engineer

LinkedIn Recruiter Post

Tech Lead, CI/CD DevOps

Desjardins