Lead Site Reliability Engineer

Remote - US

LaunchDarkly

USD 156K+ Full Time Senior

Company preview All jobs at LaunchDarkly

Search Fresh Jobs Job profile

Published 1 month ago

Hey, this job isn't fresh anymore! 👉 Find fresh remote jobs here

About the Job:

Software powers the world, and LaunchDarkly empowers all teams to deliver and control the best software. We serve trillions of feature flags daily to help teams ship better software faster and eliminate risk for companies big and small.

We're based in downtown Oakland and growing quickly. You'll help us tackle some of the most challenging engineering problems around, like delivering feature flags to hundreds of millions of users worldwide in milliseconds.

In this role, you'll oversee the health of our core systems and reliability tooling, respond to and mitigate incidents quickly, and identify and drive opportunities that make our core services more resilient. You will also identify and develop force-multiplying capabilities for our internal engineering teams, helping our engineers become more effective at shipping robust code and thinking about reliable design earlier in the lifecycle.

Our core daily technologies include AWS, Golang, CockroachDB, ElasticSearch, Redis, Flink, Kinesis, and Terraform.

Responsibilities:

Lead the development and continuous refinement of SRE tools and processes to improve software delivery, observability, reliability and operational efficiency. Your impact extends beyond your team’s boundary to proactively improve our overall service health.
Uplevel our engineering team to deliver their services with higher autonomy, reliability, and performance through offerings written in Go and Terraform, or delivered through existing tools.
Define and standardize service health and reliability metrics that align with business goals, and ensure these metrics are transparent and actionable.
Help improve the effectiveness of our incident management lifecycle and drive initiatives to train key roles involved in incident response and our post-incident review process.
Partner with various team members to define and mature our SRE culture through principles, technical frameworks, tooling, and processes. You will mentor and coach SRE team members and engineers in adjacent teams to promote a culture of SRE learning and growth.
Drive the adoption of new technologies, system designs and best practices in code health, testing, observability, and service maintainability across teams.
Proactively identify and resolve potential performance and scalability bottlenecks in our front-end and back-end systems and underlying infrastructure.
Analyze the performance of SQL queries, suggest improvements and build guardrails for teams.