Lead Site Reliability Engineer

Posted 4 hours ago

Apply Now

Resume Score

Check how well your resume matches this job before you apply.

Sign in to check score

About the role

  • Lead Site Reliability Engineer ensuring scalable, resilient services for Movable Ink at high volume content platform. Design and drive automation strategies while mentoring engineering teams.

Responsibilities

  • Define and drive the automation strategy for infrastructure tooling, establishing standards that minimize manual work, increase performance and reduce incident frequency and severity of incidents
  • Own the design, reliability and evolution of core platform applications, mentoring team members on best practices and ensuring systems meet long-term business objectives
  • Architect and lead the logging platform strategy, driving its design and balancing availability, retention and cost optimization
  • Establish capacity planning and performance management frameworks, proactively identifying scaling opportunities and guiding teams through complex troubleshooting scenarios
  • Lead cross-functional reliability initiatives with SRE and service engineering teams, influencing architectural decisions and championing practices that ensure resilient service delivery
  • Demonstrate a high level of autonomy in anticipating, identifying, and addressing systemic weaknesses and opportunities for platform improvement without direct supervision.

Requirements

  • Proven track record in Site Reliability or Software Engineering, designing, building, and owning scalable, resilient services with a focus on long-term reliability strategy
  • Deep expertise in architecting and operating complex distributed systems such as Apache Pulsar, Apache Kafka, Grafana Loki, ScyllaDB/Cassandra, with the ability to guide teams through distributed system challenges
  • Designing and owning automation strategies to manage services at scale, with expertise in establishing performance analysis frameworks and mentoring others on diagnostics and resolution
  • Deep, hands-on experience (6+ years) in Site Reliability or Software Engineering, specifically leading and shaping multi-cloud architecture and strategy (AWS and GCP).
  • Experience architecting and leading large-scale observability platforms, including defining observability standards and SLO frameworks. We use Prometheus and Thanos with Grafana Alloy, Loki and Tempo
  • Experience leading on-call excellence, including driving improvements to monitoring and alerting strategies, automating runbooks and mentoring team members on incident response best practices. Every member of the SRE team does a week long on-call rotation
  • Expert-level proficiency with infrastructure as code, including defining IaC standards and patterns across teams. We use Terraform and Chef
  • Advanced Kubernetes expertise, including cluster architecture design, multi-tenancy strategies, and guiding teams on container orchestration best practices. We use EKS and GKE
  • Proficiency in multiple programming languages with the ability to design and review code that meets reliability standards. We use NodeJS, Golang, Ruby, Python and shell scripting
  • Advanced Linux systems expertise, with the ability to diagnose complex system-level issues and mentor others on performance tuning and troubleshooting.

Benefits

  • full range of medical
  • financial
  • other benefits

Job type

Full Time

Experience level

Senior

Salary

CA$154,000 - CA$200,000 per year

Degree requirement

Bachelor's Degree

Tech skills

ApacheAWSCassandraChefCloudDistributed SystemsGoogle Cloud PlatformGrafanaKafkaKubernetesLinuxNode.jsPrometheusPulsarPythonRubyShell ScriptingTerraformGo

Location requirements

RemoteCanada

Report this job

Found something wrong with the page? Please let us know by submitting a report below.