Site Reliability Engineer managing scalable, self-healing systems at Yelp. Collaborating with global teams and ensuring platform reliability across thousands of users.
Responsibilities
Bring your curiosity, tenacity and experience
Working with engineers across Yelp in supporting new features and services
Integrating tools to monitor platform stability and performance
Help scale our Kubernetes clusters and AWS-based infrastructure while maintaining our platform's SLOs
Ensure the reliability of Yelp’s primary datastores (MySQL and Cassandra)
Troubleshoot site issues using industry-leading tools like Splunk, Grafana, and Prometheus
Automate everything with Python, Puppet, Git, Jenkins, Terraform and more!
Develop custom tools, when off-the-shelf solutions don’t work at our scale and contribute upstream to open source projects
Design and implement new systems, tests, and procedures
Participate in light on-call rotations
Requirements
Mastery of Linux (we use Ubuntu but any distro is fine)
Command of your favorite modern programming language to appreciate delivering safe and secure services: Python, Typescript, Ruby, Go, Rust, Java, C++, etc.
A solid understanding of Internet fundamental technologies in delivering services on the Internet (TCP/IP, HTTP, DNS, etc).
Experience with public cloud platforms (we use AWS and GCP, but others are also fine) and related tooling (Terraform, Puppet, Chef, Ansible etc.).
Experience with Linux containerisation and orchestration (e.g., Docker, Podman and Kubernetes).
Self-motivated to investigate, fix and improve Yelp in an ever changing environment.
Leading, Collaborating and Sharing technical activities with global teams.
Principal Site Reliability Engineer responsible for AWS infrastructure and reliability engineering. Collaborating across teams to enhance platform performance and security practices.
Junior/Intermediate DevOps Engineer role in Toronto (Hybrid). Build CI/CD pipelines with GitHub Actions, deploy Java/Spring Boot apps on OpenShift, and collaborate with DevOps teams.
Platform DevOps managing the Enterprise Data and AI Platform across AWS and Kubernetes. Implementing Infrastructure as Code with Terraform and maintaining CI/CD pipelines for secure solutions.
Lead DevOps specialized in AWS/GCP Cloud solutions for FinOps team. Driving cross - functional activation and managing cloud environments, data integrations, and automation strategies.
Skilled DevOps Engineer providing expertise in deployment automation for TD's technology solutions team. Engaging in improving development and release processes while ensuring security and system integrity.
Ingénieur fiabilité des infrastructures pour soutenir les services SaaS critiques. Collaborer, innover et optimiser la fiabilité et la performance des systèmes cloud sur AWS et Kubernetes.
DevOps Engineer to help scale cloud and on - prem environments, automating deployments and enhancing security posture for energy - intelligent compute applications.
Reliability Engineering Architect at Carbon60 managing a team to deliver AWS cloud solutions. Focus on mentoring engineers and integrating AI tools into automated systems.