Senior Site Reliability Engineer, SRE

Posted 3 hours ago

Apply Now

Resume Score

Check how well your resume matches this job before you apply.

Sign in to check score

About the role

  • Senior Site Reliability Engineer at Fable ensuring reliable and scalable infrastructure for AI-driven accessible products. Collaborating across teams to improve operational excellence and platform engineering.

Responsibilities

  • Design, build, and maintain reliable, scalable, and secure infrastructure for Fable’s product services
  • Improve system observability, monitoring, and alerting to ensure high availability and fast incident response
  • Contribute to and evolve SRE practices, including SLIs/SLOs, incident management, and postmortems
  • Support and improve CI/CD pipelines and deployment processes
  • Identify and reduce operational complexity across systems and tooling
  • Work across infrastructure and application layers to diagnose and resolve reliability and performance issues, including making targeted improvements to application code when needed
  • Support infrastructure and platform capabilities required for AI/ML-powered features, including scaling, performance, and reliability considerations
  • Monitor and optimize infrastructure costs across cloud environments
  • Contribute to capacity planning and cost forecasting for infrastructure and services
  • Identify opportunities to improve performance and efficiency at the system level
  • Evaluate and optimize the cost and performance of compute-intensive workloads (e.g., AI/ML services), ensuring efficient resource usage and scalability
  • Work with third-party vendors and tools that support Fable’s infrastructure and operations
  • Help evaluate, select, and manage tools and services to support platform reliability and scalability
  • Support vendor-related troubleshooting and ongoing service improvements
  • Partner with Engineering teams to improve reliability, performance, and operational readiness of new features
  • Partner with application engineering teams to improve service architecture, performance, and observability, and help define best practices for building reliable, scalable systems
  • Act as a point of support and escalation for production issues
  • Collaborate across teams to manage dependencies and ensure smooth system operations
  • Contribute to building strong SRE and operational practices across the organization
  • Share knowledge through documentation, pairing, and technical discussions
  • Help onboard and support more junior team members as the team grows
  • Contribute to improving ways of working within the team and across Engineering

Requirements

  • 5–8+ years of experience in Site Reliability Engineering, DevOps, Infrastructure Engineering, or Platform Engineering
  • Strong experience with cloud infrastructure (AWS, GCP, or Azure)
  • Experience building internal platforms, tooling, or shared services that improve developer productivity and system reliability
  • Experience designing systems that bridge infrastructure and application layers
  • Ability to work across the stack: comfortable reading, debugging, and making changes to application code (e.g., backend services, APIs) when needed to improve reliability, performance, or observability
  • Experience with at least one backend programming language (e.g., Node.js, Python, Go, Java)
  • Strong experience with monitoring, observability, and alerting tools (e.g., Datadog, Prometheus, Grafana)
  • Solid understanding of CI/CD systems and modern deployment practices
  • Experience managing infrastructure as code (e.g., Terraform, CloudFormation)
  • Experience optimizing system performance and infrastructure costs
  • Familiarity with security and compliance considerations in cloud environments
  • Experience working with third-party vendors and infrastructure tools
  • Familiarity with infrastructure considerations for AI/ML workloads (e.g., high-compute services, data pipelines, or third-party AI platforms) is a strong asset
  • Curiosity about emerging technologies and their impact on infrastructure, reliability, and cost at scale
  • Strong problem-solving skills and ability to navigate complex systems
  • Excellent collaboration and communication skills.

Benefits

  • stock options
  • career growth opportunities
  • professional development support
  • health and dental coverage

Job type

Full Time

Experience level

Senior

Salary

CA$130,000 - CA$150,000 per year

Degree requirement

No Education Requirement

Tech skills

AWSAzureCloudGoogle Cloud PlatformGrafanaJavaJavaScriptNode.jsPrometheusPythonTerraformGo

Location requirements

RemoteCanada

Report this job

Found something wrong with the page? Please let us know by submitting a report below.