Senior Site Reliability Engineer at Fable ensuring reliable and scalable infrastructure for AI-driven accessible products. Collaborating across teams to improve operational excellence and platform engineering.
Responsibilities
Design, build, and maintain reliable, scalable, and secure infrastructure for Fable’s product services
Improve system observability, monitoring, and alerting to ensure high availability and fast incident response
Contribute to and evolve SRE practices, including SLIs/SLOs, incident management, and postmortems
Support and improve CI/CD pipelines and deployment processes
Identify and reduce operational complexity across systems and tooling
Work across infrastructure and application layers to diagnose and resolve reliability and performance issues, including making targeted improvements to application code when needed
Support infrastructure and platform capabilities required for AI/ML-powered features, including scaling, performance, and reliability considerations
Monitor and optimize infrastructure costs across cloud environments
Contribute to capacity planning and cost forecasting for infrastructure and services
Identify opportunities to improve performance and efficiency at the system level
Evaluate and optimize the cost and performance of compute-intensive workloads (e.g., AI/ML services), ensuring efficient resource usage and scalability
Work with third-party vendors and tools that support Fable’s infrastructure and operations
Help evaluate, select, and manage tools and services to support platform reliability and scalability
Support vendor-related troubleshooting and ongoing service improvements
Partner with Engineering teams to improve reliability, performance, and operational readiness of new features
Partner with application engineering teams to improve service architecture, performance, and observability, and help define best practices for building reliable, scalable systems
Act as a point of support and escalation for production issues
Collaborate across teams to manage dependencies and ensure smooth system operations
Contribute to building strong SRE and operational practices across the organization
Share knowledge through documentation, pairing, and technical discussions
Help onboard and support more junior team members as the team grows
Contribute to improving ways of working within the team and across Engineering
Requirements
5–8+ years of experience in Site Reliability Engineering, DevOps, Infrastructure Engineering, or Platform Engineering
Strong experience with cloud infrastructure (AWS, GCP, or Azure)
Experience building internal platforms, tooling, or shared services that improve developer productivity and system reliability
Experience designing systems that bridge infrastructure and application layers
Ability to work across the stack: comfortable reading, debugging, and making changes to application code (e.g., backend services, APIs) when needed to improve reliability, performance, or observability
Experience with at least one backend programming language (e.g., Node.js, Python, Go, Java)
Strong experience with monitoring, observability, and alerting tools (e.g., Datadog, Prometheus, Grafana)
Solid understanding of CI/CD systems and modern deployment practices
Experience managing infrastructure as code (e.g., Terraform, CloudFormation)
Experience optimizing system performance and infrastructure costs
Familiarity with security and compliance considerations in cloud environments
Experience working with third-party vendors and infrastructure tools
Familiarity with infrastructure considerations for AI/ML workloads (e.g., high-compute services, data pipelines, or third-party AI platforms) is a strong asset
Curiosity about emerging technologies and their impact on infrastructure, reliability, and cost at scale
Strong problem-solving skills and ability to navigate complex systems
Senior DevOps Engineer managing Zipline's cloud infrastructure and CI/CD systems. Collaborating with engineering teams to ensure platform reliability and scalability.
Back - End & DevOps Software Developer contributing to building digital products to change the world. Specializing in back - end development and command of DevOps ecosystem for robust infrastructure.
Storage Technical Analyst providing global support for RBC's storage and backups infrastructure. Mentor operations staff and manage automation solutions for advanced incident management.
Infrastructure Engineer/SRE responsible for core infrastructure design and building tools for AI - driven contact center solutions. Join a leading AI company impacting the future of work.
DevOps Engineer intern at Sun Life focusing on Java applications and working with Docker and Kubernetes. Engage in collaborative, agile practices with the DevOps team.
Senior Developer, DevOps responsible for Azure infrastructure and automation at Radio - Canada. Collaborating with development teams to ensure optimal performance, availability, and security for digital media services.
Senior Analyst on Data Platform DevOps at AIMCo, responsible for building data operations and collaborating with teams on innovative solutions. Focused on ensuring data quality and integrity across technologies.
Site Reliability Engineer ensuring reliability, availability, and performance of Hiive's platform. Collaborating with cross - functional teams to build scalable and resilient infrastructure while supporting AI systems.
AI Security Control Developer/Site Reliability Engineer for RBC's enterprise AI ecosystem. Design, implement, and validate security controls to protect AI systems with 24/7 reliability.
DevOps Engineering Manager leading a team to improve SDLC at Vancity, Canada's largest Living Wage Employer. Collaborating across teams for reliable delivery of mission - critical systems.