About the role

Site Reliability Engineer focused on ensuring reliability and scalability of CloudBlue’s SaaS platforms. Collaborating with global teams to monitor and improve multi-tenant service providers' systems.

Responsibilities

Define and implement SLIs, SLOs, and error budgets for critical CloudBlue services to ensure reliability and performance
Influence system architecture with a strong focus on reliability, scalability, and operability, designing systems for fault tolerance, graceful degradation, and self-healing
Reduce operational toil by identifying opportunities for automation and process improvement
Design and operate CloudBlue’s observability stack across metrics, logs, and traces using tools such as Datadog, Grafana, and Elastic Stack
Develop actionable alerting strategies and dashboards that provide clear insight into platform and business health
Design and maintain high-availability architectures, implementing redundancy, failover, and disaster recovery strategies across regions and availability zones
Conduct capacity planning, load testing, and performance optimization to ensure platform stability and scalability
Act as a senior responder during production incidents, leading incident coordination, communication, and service restoration
Own blameless postmortems and drive improvements that reduce incident frequency, MTTR, and customer impact
Improve reliability of Kubernetes-based platforms through health checks, autoscaling strategies, rollout safety, and resilience testing
Partner with engineering and DevOps teams to improve deployment safety, rollback strategies, and platform reliability
Maintain runbooks and operational documentation, and promote SRE best practices across engineering teams
Support other tasks or projects as assigned to meet team and business needs

Requirements

3+ years of experience as an SRE, DevOps Engineer, or Production Engineer, with strong ownership of production systems
Proven experience operating highly available, enterprise-grade, multi-tenant SaaS platforms
Hands-on experience with observability and monitoring tools such as Datadog, Grafana, and Elasticsearch/Kibana
Solid understanding of Linux, networking, and distributed systems fundamentals
Experience working with containerized environments such as Docker and Kubernetes
Strong scripting and automation skills using Python and/or Bash
Experience participating in on-call rotations and incident response in production environments
Strong written and spoken English
Experience defining SLIs/SLOs and managing error budgets at scale will be considered a plus
Cloud experience, preferably with Azure; experience with AWS and/or GCP will also be valued
Experience working with hybrid or on-premises integrations is beneficial
Familiarity with chaos engineering and resilience testing will be considered an asset

Benefits

A competitive salary that values you and your unique skill sets
Career advancement & professional development opportunities to help you reach your full potential
Flexible work arrangements to support work/life balance

Site Reliability Engineer

at HostPapa

Resume Score

About the role

Responsibilities

Requirements

Benefits

Job title

Job type

Experience level

Salary

Degree requirement

Tech skills

Location requirements

Report this job

Similar roles

AI DevOps Specialist

LinkedIn Recruiter Post

Back-End / DevOps Software Developer

Mirego

Site Reliability Engineer

CMG (Capital Markets Gateway)

Lead DevOps Engineer

Veriday

Observability, DevOps Advisor

Intact

AI DevOps Specialist

LinkedIn Recruiter Post

Site Reliability Engineer

Chess.com

DevOps/Cloud Engineer

LinkedIn Recruiter Post

Junior Release Engineer

DECA Games

DevOps Specialist

Robotiq