Site Reliability Engineer III

Posted 4 days ago

Apply Now

Resume Score

Check how well your resume matches this job before you apply.

Sign in to check score

About the role

  • Site Reliability Engineer ensuring high availability, scalability, and performance of Emburse’s systems. Collaborating on distributed systems while mentoring junior engineers.

Responsibilities

  • Proactively identify, evaluate, and implement preventative measures to reduce customer impact.
  • Ensure all services are designed and operated with 24/7 availability, scalability, and resilience in mind.
  • Monitor, troubleshoot, and provide visibility to improve site latency, performance, and uptime.
  • Design, develop, and automate reliable cloud infrastructure and platform services.
  • Apply Infrastructure-as-Code (IaC) principles to manage large-scale distributed systems.
  • Write and maintain scripts, tools, and automation frameworks to support operational efficiency.
  • Partner with engineering leadership to develop solutions enabling developer productivity and remove cross functional dependencies.
  • Collaborate with Platform Engineering teams on project definitions, requirements, backlog grooming, and planning processes.
  • Align operational goals with product and engineering roadmaps to ensure reliability requirements are met early in the lifecycle.
  • Define non-functional requirements (NFRs) and influence standards for scalability, observability, and fault tolerance.
  • Lead cross-functional troubleshooting of complex issues spanning applications, infrastructure, databases, and networks.
  • Serve as a technical mentor to SRE I and II engineers, guiding them in best practices for reliability, automation, and incident management.
  • Lead root cause analysis and postmortem reviews, driving continuous improvement initiatives.
  • Support offshore and distributed teams, promoting effective collaboration and communication.
  • Participate in design and architecture reviews, offering technical recommendations and documentation for key stakeholders

Requirements

  • Bachelor’s degree in Computer Science or a STEM field
  • Minimum 6 years of experience in an engineering or operations role with a focus on reliability, scalability, and automation.
  • Preferred: Certified Kubernetes Administrator (CKA) and/or AWS Certification
  • Strong proficiency in Linux-based distributed environments (up to 70% hands-on work).
  • Deep experience with cloud platforms (AWS or Azure) and Infrastructure-as-Code (Terraform).
  • Excellent scripting skills (Python, Bash, Powershell); object-oriented programming experience is a plus.
  • Demonstrated ability to develop and maintain internal tools and automation solutions.
  • Excellent written and verbal communication skills in English.
  • Strong project management and organizational abilities with a bias for action.
  • Experience collaborating with offshore or globally distributed teams.
  • Expertise in containerization and orchestration technologies (Docker, Kubernetes).
  • Experience with Kubernetes scaling tooling (Karpenter, KEDA).
  • Strong understanding of DevOps principles and modern CI/CD pipelines.
  • Experience with observability stacks (Prometheus, Grafana, OpenTelemetry).
  • Familiarity with self-healing systems, and site reliability best practices.
  • Background in SaaS environments or large-scale distributed applications.
  • Analytical thinker with a focus on root-cause problem solving.
  • Self-starter with a strong ownership mentality and accountability.
  • Mentor and collaborator who uplifts teams and promotes learning culture.
  • Committed to operational excellence and continuous improvement.

Benefits

  • Competitive pay
  • Flexible work
  • Inclusive, collaborative environment that supports your success

Job type

Full Time

Experience level

Mid levelSenior

Salary

Not specified

Degree requirement

Bachelor's Degree

Tech skills

AWSAzureCloudDistributed SystemsDockerGrafanaKubernetesLinuxPrometheusPythonTerraform

Location requirements

HybridTorontoCanada

Report this job

Found something wrong with the page? Please let us know by submitting a report below.