Senior Site Reliability Engineer (with Go Experience)
Chicago, Illinois, United States - Remote
Jackbox Games is looking for an exceptional Site Reliability Engineer (with Go Experience) to join our remote team.
Who are We?
Jackbox is a small game studio (~80 people) best known for our Jackbox Party Pack franchise: a set of five social party games released every fall. Since 2014, our games--like Quiplash, Drawful, and Trivia Murder Party-- have been featured on The Tonight Show with Jimmy Fallon, by Polygon, and in living rooms and finished basements across the world.
In 2020, we had over 200 million users. And we have hard evidence that one of those users was Academy Award-winner Charlize Theron.
You can learn everything you ever wanted to know about how our games work (spoiler: your phone is the controller!) and who we are and what we make at jackboxgames.com.
What's the job?
As a Site Reliability Engineer, you will be instrumental in maintaining our AWS-based infrastructure’s high availability, performance, and scalability. In this role, 70% of your time will focus on SRE responsibilities, with a strong emphasis on managing containerized applications via ECS, monitoring, incident response, and ensuring release reliability through automated testing. The remaining 30% will involve building and maintaining applications in Go to support users and game functionality. You’ll also mentor peers, helping to upskill the team in SRE practices.
Key Responsibilities
- SRE Operations (70%)
- Reliability & Availability: Architect and manage high-availability, resilient systems on AWS ECS to support user experiences and game performance in line with service-level objectives.
- Infrastructure as Code & Automation: Use ECS for container orchestration and Terraform to automate infrastructure provisioning, ensuring repeatability and scalability.
- Monitoring & Incident Response: Improve observability using tools like CloudWatch, Prometheus, and Grafana; lead incident response and root cause analysis to improve system reliability.
- Testing Automation for Release Reliability: Develop and maintain automated testing frameworks that integrate with deployment pipelines to ensure reliable releases, minimizing deployment risks and improving system stability.
- Performance Optimization: Continuously assess and optimize system performance, ensuring efficiency, cost-effectiveness, and minimal latency.
- Team Development: Mentor team members in SRE best practices, helping to build a more resilient and skilled team.
- Application Development (30%)
- Support Applications & Tools: Develop backend applications in Go to enhance user experience and support core game functionality.
- Automation & Tooling: Build tools to streamline SRE workflows, automate operational tasks, and support infrastructure operations.
- …
This job isn't fresh anymore!
Search Fresh JobsJob Profile
401k with matching Flexible PTO Flexible Work Schedule Medical plans Remote work
Tasks- Maintain AWS infrastructure
- Manage containerized applications
- Mentor team members
Automation AWS Bash CI/CD CloudWatch Code reviews DevOps ECS Go Grafana Mentorship Prometheus Python Scripting Terraform
Experience5 years
TimezonesAmerica/Anchorage America/Chicago America/Denver America/Los_Angeles America/New_York Pacific/Honolulu UTC-10 UTC-5 UTC-6 UTC-7 UTC-8 UTC-9