Site Reliability Engineer focusing on monitoring, observability, and alerting at CMG, a fintech transforming equity capital markets.
Responsibilities
Design, implement, and maintain monitoring and observability solutions using tools like Prometheus, Grafana Stack (Loki/Grafana/Tempo/Alert Manager), Datadog, and OpenTelemetry.
Define and implement SLOs, SLIs, and error budgets to measure system reliability.
Develop and optimize dashboards, alerts, and reports for system performance and business metrics.
Design actionable alerting strategies to minimize noise and improve MTTR.
Integrate alerting systems with Jira.
Establish and refine runbooks for on-call teams to handle alerts efficiently.
Empower teams to ensure observability coverage and incident response practices.
Analyze system performance metrics, identify bottlenecks, and implement optimizations to improve system efficiency, scalability, and cost-effectiveness.
Help conduct load testing and capacity planning to ensure systems can handle peak traffic loads.
Identify opportunities for automation and develop tools to streamline operational processes, such as fail-over, configuration management, and monitoring.
Implement monitoring and alerting systems within automations to detect and resolve issues proactively.
Collaborate closely with cross-functional teams, including software engineers, operations, and infrastructure teams, to understand system requirements, provide technical guidance, and drive solutions.
Communicate effectively to stakeholders about system changes, incidents, and improvements.
Foment and spread SRE principles and practices across the company.
Requirements
Must be based in Latin America
English level - C1 or C2
Proven experience as a Site Reliability Engineer or similar role.
Proficiency in logging, metrics, and tracing frameworks (DataDog, Loki, Prometheus, OpenTelemetry).
Experience with cloud platforms (Azure preferred) and infrastructure-as-code tools (e.g., Terraform).
Strong programming and scripting skills (Python, Bash).
Proficiency in containerization technologies and orchestration tools (Docker, Kubernetes).
Understanding of Linux-based systems, networking, and security principles related to containerized applications.
Strong problem-solving and troubleshooting skills, with a passion for identifying and resolving complex technical issues.
Excellent communication and collaboration abilities.
Ability to thrive in a fast-paced, constantly evolving environment.
Experience with PostgreSQL monitoring and optimization (Optional/Nice to have).
Benefits
Equity
Unlimited PTO (15 days + bank holidays + unlimited additional paid leave)
Comprehensive benefits program managed by Globalization Partners
Site Reliability Engineer focused on ensuring reliability and scalability of CloudBlue’s SaaS platforms. Collaborating with global teams to monitor and improve multi - tenant service providers' systems.
Back - End / DevOps Software Developer focusing on building innovative digital products. Responsible for backend services and managing the DevOps ecosystem to ensure high - quality infrastructure performance.
Lead DevOps Engineer developing key features for CI/CD pipeline and enhancing developer productivity at RBC. Collaborating on integration strategies and maintaining CI/CD practices.
Observability / DevOps Advisor role overseeing reliability and performance of applications. Support teams by implementing observability platforms, focusing on CI/CD pipelines and AI.
Site Reliability Engineer at Chess.com ensuring infrastructure stability and scalable systems for millions of users. Playing a critical role in supporting rapid feature development and deployment.
Junior Release Engineer for a remote gaming company, managing builds and coordinating releases. Focusing on mobile game production and quality assurance tasks in timeline - driven environment.
DevOps Specialist optimizing infrastructure and deployment cycles for Robotiq's innovative automation solutions. Collaborating with development teams to enhance software delivery and security.