Member of Technical Staff in AI working on GPU infrastructure design and optimization. Collaborating with researchers and engineers to support AI workloads and enhance system reliability.
Responsibilities
Collaborate with researchers and engineers to understand workload requirements and translate them into infrastructure decisions.
Design and improve scheduling and resource allocation for inference and training coexistence on shared GPU clusters.
Build, operate, and scale GPU infrastructure across clusters of thousands of GPUs.
Own GPU utilization and cost as first-class metrics.
Build automated tooling and observability that reduces friction for the AI team.
Participate in on-call rotation and drive reliability improvements.
Serve as the primary point of contact for GPU providers, managing relationships and coordinating infrastructure needs.
Over time, take on broader ownership: setting scheduling policy, driving architecture decisions for compute and storage systems, and identifying when infrastructure no longer fits evolving workloads.
Requirements
Deep Systems Foundation: Linux-native.
Understands how machines work, can debug at the kernel level.
Deep understanding of networking and storage stacks.
Experience operating and scaling GPU infrastructure (hundreds to thousands of GPUs), Kubernetes, Slurm, and distributed storage systems.
Experience designing, building, and operating distributed systems at scale.
Track record of running critical infrastructure reliably — monitoring, incident response, and automation that reduces toil.
Enough understanding of training and inference workloads to collaborate with researchers and make sound infrastructure decisions.
Bonus: Experience in environments where allocation, scheduling, and prioritization of scarce resources was the core problem (e.g. HPC, trading, large-scale ML platforms).
Creatio Developer customizing and delivering CRM solutions for Bits In Glass. Collaborating with analysts and architects in an Agile environment to tackle business challenges.
Urgent full - time roles in Toronto: QE with Contact Center, Business System Analyst, Platform Lead, Integration Engineer, Peoplesoft Functional Analyst, APM Data Stewards, AWS Infra Architect.
Senior IT Project Manager for Government of Alberta in Edmonton. Requires public sector experience, PMP/CSM certifications, and expertise in Agile, complex IT systems, and stakeholder management.
Project Manager for ABS Group overseeing construction projects from planning to billing. Ensuring quality control and client communication throughout project execution.
Chargé de projets en génie des matériaux gérant la planification et la qualité des travaux. Responsabilité de la réussite des projets et supervision des chantiers avec une équipe dynamique.
Senior commodity engineer managing electronic component quality for Celestica. Collaborating with suppliers to drive quality improvements and maintain standards.
Senior Mobile Developer designing and evolving mobile applications for Sherpa, a leader in complex event apps. Collaborating with teams to ensure application stability and functionality.
Mobile Developer designing and developing mobile applications for extensive user bases at Sherpa. Collaborating across teams and utilizing various technologies in a dynamic environment.
Director of Engineering overseeing end - to - end engineering delivery for Group Benefits Claims Technology. Responsible for high availability and performance across multiple teams.