About the role

Site Reliability Engineer managing scalable, self-healing systems at Yelp. Collaborating with global teams and ensuring platform reliability across thousands of users.

Responsibilities

Bring your curiosity, tenacity and experience
Working with engineers across Yelp in supporting new features and services
Integrating tools to monitor platform stability and performance
Help scale our Kubernetes clusters and AWS-based infrastructure while maintaining our platform's SLOs
Ensure the reliability of Yelp’s primary datastores (MySQL and Cassandra)
Troubleshoot site issues using industry-leading tools like Splunk, Grafana, and Prometheus
Automate everything with Python, Puppet, Git, Jenkins, Terraform and more!
Develop custom tools, when off-the-shelf solutions don’t work at our scale and contribute upstream to open source projects
Design and implement new systems, tests, and procedures
Participate in light on-call rotations

Mastery of Linux (we use Ubuntu but any distro is fine)
Command of your favorite modern programming language to appreciate delivering safe and secure services: Python, Typescript, Ruby, Go, Rust, Java, C++, etc.
A solid understanding of Internet fundamental technologies in delivering services on the Internet (TCP/IP, HTTP, DNS, etc).
Experience with public cloud platforms (we use AWS and GCP, but others are also fine) and related tooling (Terraform, Puppet, Chef, Ansible etc.).
Experience with Linux containerisation and orchestration (e.g., Docker, Podman and Kubernetes).
Self-motivated to investigate, fix and improve Yelp in an ever changing environment.
Leading, Collaborating and Sharing technical activities with global teams.
Own the total lifecycle of a system.