Senior Site Reliability Engineer

Remote

Toast

Published 4 months ago

Hey, this job isn't fresh anymore! 👉 Find fresh remote jobs here

Toast is driven by building the restaurant platform that helps restaurants adapt, take control, and get back to what they do best: building the businesses they love.

At Toast, our Site Reliability Engineers (SREs) are responsible for enabling our engineering teams to ensure customer-facing services and other Toast production systems are running smoothly. SREs are a blend of pragmatic operators and software craftspeople who apply sound software engineering principles, operational discipline, and mature automation to our environments and our codebase.

About this roll* (Responsibilities)

Implement and evolve a world-class observability technology stack that allows rapid detection of issues in our system and enables root cause analysis (25%)

Provide scalable metrics and dashboarding solutions for R&D
Provide distributed tracing capabilities to visualize and track issues across our complex system
Provide log aggregation and insights for R&D using best in class technology
Provide a global view of the true customer experience through usage of Real-User Monitoring & external cloud-based solutions

Act as a champion for reliability and work with partner teams in different lines of business to improve resiliency and reliability of all services. Champion our uptime targets and enable other teams to improve the way we measure the reliability of the system (25%)
Facilitate and drive production triage, incident resolution, and retrospective/root cause analysis to maintain the reliability and uptime of our platform (20%)

Leverage a strong understanding of Cloud Architecture
Experience developing and operating software on the JVM (Java Virtual Machine) to triage and understand issues within services
Diagnose performance bottlenecks and implement optimizations across infrastructure, database, web, and mobile applications
Implement strategies to increase system reliability and performance through on-call rotation and process optimization
Lead incident post-mortem/retrospectives to surface reliability improvements and drive to completion

Support and enable the adoption of a platform that enables service resilience testing/chaos engineering to validate and test Toast’s architecture is resilient to failure. Build and own a performance testing framework/environment to enable our R&D teams to understand the constraints of their services and improve performance (15%)

Do you have the right ingredients*? (Requirements)

Extensive and broad industry experience with at least 3-7 years building and running production systems and participating in incident calls
Deep understanding of cloud and microservice architecture, and the JVM
Comfortable …

This job isn't fresh anymore!

Search Fresh Jobs

Job Profile

Restrictions

Remote

Benefits/Perks

Benefits programs Competitive compensation Competitive compensation and benefits Competitive compensation and benefits programs Equity Flexibility Flexible benefits Healthy lifestyle Total Rewards package Total rewards package goes beyond great earnings potential

Tasks

Optimize system performance

Skills

Automation AWS Cloud Architecture Customer Experience Datadog Design Distributed Systems Engineering Incident Management Java JVM Microservices New Relic Performance Testing R Root Cause Analysis Site Reliability Engineering Splunk Technology

Experience

3-7 years

Education

Business Engineering

Remote Jobs in North America Remote Jobs in Europe Remote Jobs in Asia/Pacific Remote Jobs in South America Remote Jobs in Africa Remote Jobs in Middle East Full Time Remote Jobs Part Time Remote Jobs Contract Remote Jobs Internship Remote Jobs Temporary Remote Jobs Freelance Remote Jobs Mid-Level Remote Jobs Senior-Level Remote Jobs Entry-Level Remote Jobs Exec-Level Remote Jobs Lead-Level Remote Jobs Junior-Level Remote Jobs