About the role

Senior Site Reliability Engineer ensuring reliability and performance of Vantage’s services while collaborating across teams. Engaging in incident response and driving infrastructure improvements.

Responsibilities

Collaborate with a diverse team of software engineers, engaging in iterative processes and effective task planning to drive our projects forward.
Take ownership of the availability, scalability, and performance of our services, to proactively identify issues, and implement automation to prevent the recurrence of problems.
Participate in the on-call rotation, responding to incidents and working with the team to restore service and prevent recurrence.
Contribute to automating infrastructure provisioning, configuration, and management using IaC principles with tools like Terragrunt and Ansible.
Help design and enhance monitoring, logging, and alerting systems to improve observability and ensure system health.
Participate in blameless post-mortems, documenting issues, and following up on action items to foster a culture of learning and continuous improvement.
Foster collaboration with other engineering teams, promoting the reuse of existing frameworks and gaining insights into their operation.
Stay current with industry trends, emerging technologies, and best practices in SRE, DevOps, and automation.

6+ years of experience as a Site Reliability Engineer, DevOps Engineer, or similar role working with software and infrastructure.
Proficiency with either Python or Bash.
Hands-on experience with Azure or AWS.
Familiarity with CI/CD pipelines and infrastructure as code (IaC) and its tooling such as terraform and ansible.
Demonstrated ability to triage and prioritize effectively when troubleshooting incidents.
History of engaging effectively with cross-functional teams during events such as incident-response and post-mortems.
Track-record of proactively tailoring infrastructure to meet the unique needs of the product it supports.