Senior Site Reliability Engineer, AI Infrastructure

Posted 2 hours ago

Apply Now

Resume Score

Check how well your resume matches this job before you apply.

Sign in to check score

About the role

  • Senior Site Reliability Engineer at PointClickCare focused on AI platforms reliability and operational excellence. Collaborating with cross-functional teams to ensure secure and efficient service delivery.

Responsibilities

  • Own service level objectives, error budgets, and reliability targets for the infrastructure underpinning cloud-based platforms — ensuring infrastructure observability (metrics, logs, traces), alert quality, and telemetry completeness across platform components and serving endpoints
  • Design, build, and maintain infrastructure-as-code, operational automation, and change control workflows for AI/ML platforms — with a focus on repeatability, consistency, and toil reduction
  • Implement and maintain platform security controls — including network segmentation, secrets management, encryption, and data protection safeguards — aligned to compliance requirements and partnering with security teams to respond to emerging risks
  • Lead incident response and blameless postmortems; validate backup/restore and disaster recovery processes; conduct game days and resiliency testing to harden platform and infrastructure reliability
  • Mentor engineers, influence design reviews, and collaborate across engineering teams to improve platform resiliency, cost efficiency, capacity planning, and operational standards

Requirements

  • Minimum:
  • 5+ years in SRE, platform engineering, or infrastructure roles supporting production cloud environments and mission-critical applications
  • Strong proficiency with observability — metrics, logging, distributed tracing, SLI/SLO frameworks — and production ownership including incident response, blameless postmortems, and on-call operations
  • Strong proficiency with Infrastructure as Code (Terraform), GitOps practices, and CI/CD for infrastructure and platform changes
  • Working proficiency with cloud platform administration — compute, networking, storage, and operating managed data or AI/ML platform services in production (e.g., Databricks, Azure ML, or Kubernetes-hosted infrastructure)
  • Working proficiency with platform security — network segmentation, secrets management, encryption at rest and in transit, and key management
  • Strong programming skills for automation, operational tooling, and infrastructure management
  • Strong communication and documentation skills — able to write runbooks, lead postmortems, influence operational standards across teams, and translate technical complexity for diverse audiences
  • Preferred:
  • Experience with disaster recovery planning, multi-region patterns, and capacity or cost optimization (FinOps)
  • Working knowledge of container orchestration (Kubernetes), progressive delivery patterns (blue/green, canary), and data lineage tooling
  • Experience in healthcare, life sciences, or other highly regulated industries with data privacy requirements

Benefits

  • Benefits starting from Day 1!
  • Retirement Plan Matching
  • Flexible Paid Time Off
  • Wellness Support Programs and Resources
  • Parental & Caregiver Leaves
  • Fertility & Adoption Support
  • Continuous Development Support Program
  • Employee Assistance Program
  • Allyship and Inclusion Communities
  • Employee Recognition … and more!

Job type

Full Time

Experience level

Senior

Salary

CA$139,000 - CA$155,000 per year

Degree requirement

Bachelor's Degree

Tech skills

AzureCloudKubernetesTerraform

Location requirements

HybridMississaugaCanada

Report this job

Found something wrong with the page? Please let us know by submitting a report below.