Site Reliability Engineer at Chess.com ensuring infrastructure stability and scalable systems for millions of users. Playing a critical role in supporting rapid feature development and deployment.
Responsibilities
Design and implement multi-regional resilient infrastructure capable of handling millions of concurrent sessions and transactions daily across global data centers
Lead the hybrid cloud migration strategy, integrating bare-metal datacenter resources with cloud services for optimal performance and cost efficiency
Own the on-call rotation and incident response procedures, ensuring rapid resolution of critical system issues and maintaining high availability SLAs
Architect monitoring and alerting systems using industry-standard tools to proactively identify and resolve performance bottlenecks before they impact users
Collaborate with development teams to implement infrastructure-as-code practices and establish deployment pipelines that support continuous integration and delivery
Optimize system performance through capacity planning, load testing, and resource allocation across distributed computing environments
Establish and maintain security protocols and risk assessment procedures for infrastructure components and data protection
Partner with engineering teams to design scalable solutions for high-traffic applications and real-time processing requirements
Drive automation initiatives to reduce manual operational overhead and improve system reliability through scripting and configuration management
Mentor team members on SRE best practices and contribute to the development of infrastructure standards and documentation
Requirements
Bachelor's degree in Computer Science, Engineering, or related technical field, or equivalent practical experience
5+ years of experience in site reliability engineering, DevOps, or infrastructure engineering roles
Strong proficiency with UNIX/Linux operating systems and command-line administration
Experience with cloud platforms (GCP, AWS, or Azure) and infrastructure-as-code tools (Terraform, CloudFormation, or similar)
Hands-on experience with configuration management systems (Ansible, Chef, Puppet, or similar)
Solid understanding of networking fundamentals, protocols (TCP/IP, HTTP/HTTPS, DNS), and network troubleshooting
Experience with containerization and orchestration technologies (Docker, Kubernetes, or similar)
Proficiency with monitoring and observability tools (Datadog, Prometheus, Grafana, ELK stack, or similar)
Experience with relational and NoSQL databases, including performance optimization and scaling strategies
Strong collaboration and communication skills for working effectively in a distributed team environment
Demonstrated sense of ownership and accountability for system reliability and performance
Nice to have: Experience managing bare-metal server infrastructure and datacenter operations
Advanced knowledge of content delivery networks (CDNs) and edge computing
Experience with server-side automation and scripting languages (Python, Go, Bash, or similar)
Background in high-availability architectures and disaster recovery planning
Familiarity with security frameworks and compliance requirements
Experience with game server infrastructure or real-time application hosting
Knowledge of database administration and optimization for high-concurrency applications
Understanding of CI/CD pipelines and deployment automation
Experience with capacity planning and performance testing tools
Previous experience in a fully remote, distributed work environment
Continuous learning mindset with interest in emerging infrastructure technologies
Site Reliability Engineer focused on ensuring reliability and scalability of CloudBlue’s SaaS platforms. Collaborating with global teams to monitor and improve multi - tenant service providers' systems.
Back - End / DevOps Software Developer focusing on building innovative digital products. Responsible for backend services and managing the DevOps ecosystem to ensure high - quality infrastructure performance.
Lead DevOps Engineer developing key features for CI/CD pipeline and enhancing developer productivity at RBC. Collaborating on integration strategies and maintaining CI/CD practices.
Observability / DevOps Advisor role overseeing reliability and performance of applications. Support teams by implementing observability platforms, focusing on CI/CD pipelines and AI.
Junior Release Engineer for a remote gaming company, managing builds and coordinating releases. Focusing on mobile game production and quality assurance tasks in timeline - driven environment.
DevOps Specialist optimizing infrastructure and deployment cycles for Robotiq's innovative automation solutions. Collaborating with development teams to enhance software delivery and security.