Own operational reliability of cloud load balancing infrastructure serving global customers. Design and implement frameworks reflecting customer experience for reliability management.
Responsibilities
Owning the SRE infrastructure lifecycle from design reviews and pre-rollout readiness assessments through production sign-off and ongoing reliability management
Designing and implementing frameworks that reflect customer experience for load balancing services and driving action when error budgets are at risk
Building and maintaining observability pipelines from load-balancing components and system-level sources to dashboards that enable rapid incident triage
Leading technical incident response for complex NB/NLB failures, acting as the technical commander and driving root cause analysis and preventive follow-through
Developing and automating safe deployment workflows for phased releases, including bake-period monitoring, feature flag management, and validation across global datacenter rollouts
Reviewing design documents, product-requirement documents and producing actionable SRE input on operational risks, capacity implications, Day-2 concerns, and product strategy gaps
Building automation and tooling using Python or Go that reduces operational toil and improves team-wide operational capability
Requirements
8+ years of experience in SRE, infrastructure engineering, or platform engineering, working with large-scale distributed systems
Demonstrate deep expertise with Linux networking fundamentals and diagnosing at the packet level using tcpdump, netstat, and similar tools
Have hands-on experience with L4/L7 load balancing technologies covering configuration, health checking, high availability, and failure modes at scale
Show a track record of defining SLO/SLI frameworks, building observability platforms from scratch, and running incident management processes at scale
Demonstrate expertise in Kubernetes and containerization at scale including workload scheduling, networking, resource management, and operating stateful or network-intensive workloads in a cluster environment
Build automation and tooling using Python or Go, with infrastructure-as-code experience (SaltStack, Ansible, or Terraform) and deployment safety instincts.
Benefits
healthcare
RRSP
company holidays
vacation (in the form of PTO)
sick time
family friendly benefits including employee assistance program including a focus on mental and financial wellness
Senior Infrastructure/ DevOps Engineer at a fintech company building products and solutions. Work with industry - leading clients and participate in the development of cutting - edge technology solutions.
Back - End / DevOps Developer building innovative digital products to change the world. Specializing in back - end development and the DevOps ecosystem in the software development team.
Site Reliability Engineer responsible for the installation, configuration, maintenance of middleware technologies at Hyve Solutions. Managing applications on container platforms and ensuring reliable operation of critical middleware components.
Director of IT Operations & DevOps leading infrastructure and DevOps at CanadaHelps. Focus on operational reliability, improvements, and team collaboration in a technology - driven environment.
Senior Site Reliability Engineer ensuring platform reliability at Circle. Managing systems and database infrastructure to support high growth in user engagement and system performance.
DevOps II role providing production support for Java - based applications. Involves incident management, CI/CD operations, and collaboration on cloud platforms.
Senior DevOps Engineer at Ad Hoc contributing to DevOps and software engineering strategies. Collaborating across teams and mentoring members to improve software delivery processes.
Senior DevOps Engineer designing and managing cloud infrastructure at Borrowell, a company helping Canadians with their finances. Collaborating with development, security, and QA teams to enhance service delivery.
Senior DevOps Engineer responsible for enhancing CI/CD processes at EQ Bank's IT team. Collaborating with developers to streamline software delivery and operations.
Senior Site Reliability Engineer joining SaaS - Ops team at Magnet Forensics. Overseeing Kubernetes clusters and operational reliability in cloud environments for law enforcement customers.