Resume Score

Check how well your resume matches this job before you apply.

Sign in to check score

About the role

  • Site Reliability Engineer focusing on monitoring, observability, and alerting at CMG, a fintech transforming equity capital markets.

Responsibilities

  • Design, implement, and maintain monitoring and observability solutions using tools like Prometheus, Grafana Stack (Loki/Grafana/Tempo/Alert Manager), Datadog, and OpenTelemetry.
  • Define and implement SLOs, SLIs, and error budgets to measure system reliability.
  • Develop and optimize dashboards, alerts, and reports for system performance and business metrics.
  • Design actionable alerting strategies to minimize noise and improve MTTR.
  • Integrate alerting systems with Jira.
  • Establish and refine runbooks for on-call teams to handle alerts efficiently.
  • Empower teams to ensure observability coverage and incident response practices.
  • Analyze system performance metrics, identify bottlenecks, and implement optimizations to improve system efficiency, scalability, and cost-effectiveness.
  • Help conduct load testing and capacity planning to ensure systems can handle peak traffic loads.
  • Identify opportunities for automation and develop tools to streamline operational processes, such as fail-over, configuration management, and monitoring.
  • Implement monitoring and alerting systems within automations to detect and resolve issues proactively.
  • Collaborate closely with cross-functional teams, including software engineers, operations, and infrastructure teams, to understand system requirements, provide technical guidance, and drive solutions.
  • Communicate effectively to stakeholders about system changes, incidents, and improvements.
  • Foment and spread SRE principles and practices across the company.

Requirements

  • Must be based in Latin America
  • English level - C1 or C2
  • Proven experience as a Site Reliability Engineer or similar role.
  • Proficiency in logging, metrics, and tracing frameworks (DataDog, Loki, Prometheus, OpenTelemetry).
  • Experience with cloud platforms (Azure preferred) and infrastructure-as-code tools (e.g., Terraform).
  • Strong programming and scripting skills (Python, Bash).
  • Proficiency in containerization technologies and orchestration tools (Docker, Kubernetes).
  • Understanding of Linux-based systems, networking, and security principles related to containerized applications.
  • Strong problem-solving and troubleshooting skills, with a passion for identifying and resolving complex technical issues.
  • Excellent communication and collaboration abilities.
  • Ability to thrive in a fast-paced, constantly evolving environment.
  • Experience with PostgreSQL monitoring and optimization (Optional/Nice to have).

Benefits

  • Equity
  • Unlimited PTO (15 days + bank holidays + unlimited additional paid leave)
  • Comprehensive benefits program managed by Globalization Partners
  • Premium life and income protection
  • Top private medical and dental insurance
  • Employee Assistance Program (EAP)
  • Pension contributions
  • Remote work environment
  • Education reimbursement
  • Continuous learning opportunities
  • Employee referral bonus
  • Parental leave

Job type

Full Time

Experience level

Mid levelSenior

Salary

Not specified

Degree requirement

No Education Requirement

Tech skills

AzureCloudDockerGrafanaKubernetesLinuxPostgresPrometheusPythonTerraform

Location requirements

RemoteCanada

Report this job

Found something wrong with the page? Please let us know by submitting a report below.