DevOps Engineer at Knotch focusing on building and scaling infrastructure for AI-driven marketing technology. Collaborate across teams to enhance reliability and performance.
Responsibilities
Design, build, and maintain scalable, secure, and highly available infrastructure across pre-production and production environments
Develop and manage CI/CD pipelines to enable fast, reliable, and repeatable deployments across multiple environments
Own infrastructure as code (IaC) practices using tools like Terraform to ensure consistency and reproducibility
Manage environment lifecycle (development, staging, production), including promotion workflows and configuration management
Partner closely with Engineering, Data, and AI teams to support system performance, reliability, and scalability
Implement and maintain monitoring, logging, and alerting systems to ensure high visibility into system health and performance
Optimize infrastructure for cost, performance, and reliability, especially for compute- and data-intensive AI workloads
Support Kubernetes-based deployments and container orchestration for distributed systems
Contribute to security best practices across infrastructure, including IAM, networking, and application-level protections
Create dashboards and reporting systems to provide visibility into system performance, uptime, and operational metrics
Document architecture, operational processes, and infrastructure decisions to support knowledge sharing and onboarding
Act as a DevOps/SRE partner across teams, helping troubleshoot issues and improve system reliability
Requirements
5+ years of experience in DevOps, Site Reliability Engineering, or Infrastructure Engineering roles within SaaS, PaaS, or cloud-native environments
Prior experience in growth-stage and/or startup environment scaling from $10M to $20M+ ARR with a lean team
Strong experience with Google Cloud Provider (GCP), including IAM, networking, and data services
Hands-on experience with Infrastructure as Code tools such as Terraform
Experience building and maintaining CI/CD pipelines (GitHub Actions, ArgoCD, or similar)
Solid experience with Kubernetes, Docker, and containerized environments
Familiarity with deployment tools such as Helm
Experience with monitoring and observability tools like Prometheus and Grafana
Strong understanding of system reliability, scalability, and performance optimization
Ability to work across multiple systems and priorities in a dynamic environment
Strong documentation and communication skills, with attention to clarity and detail
Supplementary experience supporting AI/ML or data-intensive workloads in production environments (Nice-to-Have)
Familiarity with workflow orchestration or data pipeline tools (Nice-to-Have)
Experience with cost optimization strategies for cloud infrastructure (Nice-to-Have)
Exposure to security frameworks and compliance best practices (Nice-to-Have)
Experience working with distributed or globally deployed systems (Nice-to-Have)
Benefits
Comprehensive medical, dental, and vision insurance eligibility
Lead Site Reliability Developer delivering consulting across teams for Ticketmaster's SRE practices. Focused on enhancing reliability, resilience, and engineering practices globally from Toronto or Quebec.
Lead Site Reliability Developer guiding reliability initiatives at Ticketmaster. Collaborating across teams to improve engineering practices and mentor consultants in SRE principles.
Staff Software Engineer contributing across Newrich's product, improving infrastructure and workflows while handling production code for creator platform.
DevOps Release Manager at Synechron responsible for managing CI/CD pipelines for trading systems. Leading data services and ensuring compliance in a fast - paced capital markets environment.
Join NetBrain as a Software Release Engineer managing product releases and CI/CD processes. Collaborate with development teams to enhance network management applications in a hybrid work environment.
Site Reliability Engineer at Boeing involved in modernizing and migrating systems to the cloud. Responsibilities include infrastructure management and ensuring reliable operations.
DevOps Engineer responsible for operational stability of a global IVR platform across AWS and on - premises. Collaborating with cross - functional teams to support releases and troubleshoot issues.