Site Reliability Engineer focused on ensuring reliability and scalability of CloudBlue’s SaaS platforms. Collaborating with global teams to monitor and improve multi-tenant service providers' systems.
Responsibilities
Define and implement SLIs, SLOs, and error budgets for critical CloudBlue services to ensure reliability and performance
Influence system architecture with a strong focus on reliability, scalability, and operability, designing systems for fault tolerance, graceful degradation, and self-healing
Reduce operational toil by identifying opportunities for automation and process improvement
Design and operate CloudBlue’s observability stack across metrics, logs, and traces using tools such as Datadog, Grafana, and Elastic Stack
Develop actionable alerting strategies and dashboards that provide clear insight into platform and business health
Design and maintain high-availability architectures, implementing redundancy, failover, and disaster recovery strategies across regions and availability zones
Conduct capacity planning, load testing, and performance optimization to ensure platform stability and scalability
Act as a senior responder during production incidents, leading incident coordination, communication, and service restoration
Own blameless postmortems and drive improvements that reduce incident frequency, MTTR, and customer impact
Improve reliability of Kubernetes-based platforms through health checks, autoscaling strategies, rollout safety, and resilience testing
Partner with engineering and DevOps teams to improve deployment safety, rollback strategies, and platform reliability
Maintain runbooks and operational documentation, and promote SRE best practices across engineering teams
Support other tasks or projects as assigned to meet team and business needs
Requirements
3+ years of experience as an SRE, DevOps Engineer, or Production Engineer, with strong ownership of production systems
Back - End / DevOps Software Developer focusing on building innovative digital products. Responsible for backend services and managing the DevOps ecosystem to ensure high - quality infrastructure performance.
Lead DevOps Engineer developing key features for CI/CD pipeline and enhancing developer productivity at RBC. Collaborating on integration strategies and maintaining CI/CD practices.
Observability / DevOps Advisor role overseeing reliability and performance of applications. Support teams by implementing observability platforms, focusing on CI/CD pipelines and AI.
Site Reliability Engineer at Chess.com ensuring infrastructure stability and scalable systems for millions of users. Playing a critical role in supporting rapid feature development and deployment.
Junior Release Engineer for a remote gaming company, managing builds and coordinating releases. Focusing on mobile game production and quality assurance tasks in timeline - driven environment.
DevOps Specialist optimizing infrastructure and deployment cycles for Robotiq's innovative automation solutions. Collaborating with development teams to enhance software delivery and security.