Senior Site Reliability Engineer - Databases (Remote, USA))

United States (Remote)

Grafana Labs Remote-first

USD 148K+ Full Time Senior

Company preview All jobs at Grafana Labs

Search Fresh Jobs Job profile

Published 1 month ago

Hey, this job isn't fresh anymore! 👉 Find fresh remote jobs here

Senior Site Reliability Engineer - Databases

This is a remote position and we're considering candidates in the USA & Canada.

About the role:

We are looking for a Senior SRE to help us support our highest value Grafana Cloud customers by increasing the reliability of our Cloud databases that are based on Mimir, Loki, Tempo, and Pyroscope. We provide these databases as a SaaS product from AWS, GCP, and Azure across all regions.

The SRE team is a new team within the Databases department, that owns the environments (customer and product cells) for our largest customers, and acts as an overlay to existing teams that run the databases within the system. As an SRE within the team, you own the configuration of the software via Helm charts and Jsonnet, being involved with the PRR for new features, shepherding releases to the environment and ensuring new releases do not degrade the SLOs or user experience for the customer (learn what is special about each of these customers, and mitigate risks that might be produced by a change in the software), directly contributing design docs, code, PR review, and other engineering activities to the databases to further improve reliability for the customer, observability of the customer stack, and making recommendations to customers about their use of the system to further improve reliability.

Like all SRE roles there is an on-call element, unlike other roles this one is a shared pager where “if the Mimir team are paged for this customer, then we are also paged”, this allows you to focus your response on the experience the customer has, whilst also being supported by another on-call engineer who will focus on the system. As a company, we hire globally (remote-only) to ensure our on-call is as healthy as possible, and aligned to 12 daylight hours per day as the default.

What we seek:

Strong engineering background (at least 6 years), that lean towards SRE roles (at least 3 years)

This may encompass but is not limited to experience as a reliability/production engineer, infrastructure/systems engineer, or software engineer with an infrastructure/systems focus.

Good communication, capable of engaging in deep technical conversations with other engineers and customers, and collaborating across organizational boundaries
Experience with Kubernetes on any of AWS, GCP, or Azure, and working with Helm charts or other IaC tools.