Member of Technical Staff – Infrastructure, World Models

Posted yesterday

Apply Now

Resume Score

Check how well your resume matches this job before you apply.

Sign in to check score

About the role

  • Member of Technical Staff in AI working on GPU infrastructure design and optimization. Collaborating with researchers and engineers to support AI workloads and enhance system reliability.

Responsibilities

  • Collaborate with researchers and engineers to understand workload requirements and translate them into infrastructure decisions.
  • Design and improve scheduling and resource allocation for inference and training coexistence on shared GPU clusters.
  • Build, operate, and scale GPU infrastructure across clusters of thousands of GPUs.
  • Own GPU utilization and cost as first-class metrics.
  • Build automated tooling and observability that reduces friction for the AI team.
  • Participate in on-call rotation and drive reliability improvements.
  • Serve as the primary point of contact for GPU providers, managing relationships and coordinating infrastructure needs.
  • Over time, take on broader ownership: setting scheduling policy, driving architecture decisions for compute and storage systems, and identifying when infrastructure no longer fits evolving workloads.

Requirements

  • Deep Systems Foundation: Linux-native.
  • Understands how machines work, can debug at the kernel level.
  • Deep understanding of networking and storage stacks.
  • Experience operating and scaling GPU infrastructure (hundreds to thousands of GPUs), Kubernetes, Slurm, and distributed storage systems.
  • Experience designing, building, and operating distributed systems at scale.
  • Track record of running critical infrastructure reliably — monitoring, incident response, and automation that reduces toil.
  • Enough understanding of training and inference workloads to collaborate with researchers and make sound infrastructure decisions.
  • Bonus: Experience in environments where allocation, scheduling, and prioritization of scarce resources was the core problem (e.g. HPC, trading, large-scale ML platforms).

Benefits

  • Competitive salary and equity
  • Private health coverage
  • Pension contribution (UK, Canada, US)
  • Unlimited paid vacation
  • Fully-distributed, async-first culture
  • Hardware setup of your choice
  • Stipends for phone, internet, and meals

Job type

Full Time

Experience level

Lead

Salary

Not specified

Degree requirement

Bachelor's Degree

Tech skills

Distributed SystemsKubernetesLinux

Location requirements

RemoteCanada

Report this job

Found something wrong with the page? Please let us know by submitting a report below.