About the role

Member of Technical Staff in AI working on GPU infrastructure design and optimization. Collaborating with researchers and engineers to support AI workloads and enhance system reliability.

Responsibilities

Collaborate with researchers and engineers to understand workload requirements and translate them into infrastructure decisions.
Design and improve scheduling and resource allocation for inference and training coexistence on shared GPU clusters.
Build, operate, and scale GPU infrastructure across clusters of thousands of GPUs.
Own GPU utilization and cost as first-class metrics.
Build automated tooling and observability that reduces friction for the AI team.
Participate in on-call rotation and drive reliability improvements.
Serve as the primary point of contact for GPU providers, managing relationships and coordinating infrastructure needs.
Over time, take on broader ownership: setting scheduling policy, driving architecture decisions for compute and storage systems, and identifying when infrastructure no longer fits evolving workloads.

Deep Systems Foundation: Linux-native.
Understands how machines work, can debug at the kernel level.
Deep understanding of networking and storage stacks.
Experience operating and scaling GPU infrastructure (hundreds to thousands of GPUs), Kubernetes, Slurm, and distributed storage systems.
Experience designing, building, and operating distributed systems at scale.
Track record of running critical infrastructure reliably — monitoring, incident response, and automation that reduces toil.
Enough understanding of training and inference workloads to collaborate with researchers and make sound infrastructure decisions.
Bonus: Experience in environments where allocation, scheduling, and prioritization of scarce resources was the core problem (e.g. HPC, trading, large-scale ML platforms).