Site Reliability Engineer specializing in Kafka, managing Yelp’s data streaming infrastructure. Collaborating on projects to ensure the reliability and performance of critical services across hybrid and multi-cloud environments.
Responsibilities
Design, deploy, and maintain large-scale Kafka event streaming infrastructure across hybrid and multi-cloud environments
Collaborate with engineers to enable new features, ensure data pipeline reliability, and advise on best practices for real-time data processing
Execute and automate Kafka cluster upgrades, migrations, and major version rollouts with minimal impact to critical services
Build or enhance self-service capabilities and automation for cluster operations, scaling, and incident recovery
Troubleshoot complex issues affecting data flow, performance, or stability, and drive root cause analyses
Participate in on-call rotations.
Requirements
Strong hands-on experience designing and implementing large-scale Kafka event streaming capabilities in production, across hybrid or multi-cloud and Linux environments
In-depth knowledge of event streaming/data-in-motion design principles, architecture, and operational nuances
Programming proficiency in Java, Python, or similar modern languages for tooling, integration, and automation
Familiarity with Kafka Client APIs (Producer, Consumer, Streams), as well as sizing and capacity planning for high-throughput clusters
Experience designing and optimizing real-time data streaming solutions with technologies like Apache Flink
Knowledge of automating infrastructure and operational tasks (configuration management, IaC, scripting, or related)
Problem-solving mindset with an eagerness to learn, take initiative, and advocate for infrastructure best practices in a fast-paced environment.
Senior DevOps/MLOps/Data Engineer (Azure) role designing CI/CD pipelines, deploying AI models, and building scalable data platforms. Fully remote contract position.
DevOps Engineer responsible for maintaining FME infrastructure and development pipelines at Safe Software. Collaborate in an agile team focused on constant improvement and automation.
Senior Site Reliability Engineer ensuring scalability and performance of infrastructure at Semios Group. Collaborating with high - performing teams to improve product reliability and automation.
Site Reliability Engineer supporting backend systems in a digital assets holding company. Collaborating on infrastructure projects across various blockchain ecosystems with a focus on DevOps best practices.
Site Reliability Principal Specialist at Sherweb responsible for enhancing reliability across IT Operations platforms and services. Implementing proactive and scalable approaches to site reliability while influencing technical direction.
DevOps Manager responsible for service delivery and cloud & web systems reliability at Cority. Architecting CI/CD environments and mentoring technical team members in DevOps practices.
Sr. DevOps Engineer for Cority working on deployment and operation of systems. Collaborating to deliver automated cloud infrastructures and continuous delivery processes in a remote Canada role.
DevOps Engineer managing CI/CD and infrastructure improvements for growing crypto company. Collaborating in a remote team across Canada to enhance operational processes and services.
Site Reliability Engineer enhancing reliability and operational readiness of services at Newton. Collaborating with engineering teams for system design and incident management.