Site Reliability Engineer - New York or Remote

About Knock

Knock is on a mission to help products communicate with their users in a more thoughtful way. Building product notifications in-house takes months, often leading to poor user experiences. We believe that—when done right—product notifications help users find value in the products they use every day. That’s why we built Knock.

We're a remote-first (with a NYC base) seed stage startup of 20 employees that believe in the power of great software. We're APIs all the way down at Knock—Stripe for payments, Algolia for search, WorkOS for SSO. We're excited to add Knock to that list and to push forward the API-first movement. If you are, too, come join us and let's build something great together.

We’re backed by top investors and operators including Craft Ventures, Afore Capital, Preface Ventures, Worklife Capital, Guillermo Rauch (CEO/Founder @ Vercel), Scott Belsky (CPO @ Adobe), Adam Gross (CEO @ Heroku), John Kodumal (CTO @ LaunchDarkly), Nate Stewart (CPO @ Cockroach Labs), Charley Ma, and Zach Holman, to name a few.

About the role

We're looking for an SRE to join our small but growing platform team. Platform engineering at Knock is the foundation for everything else we do. Because Knock is built by and for engineers, there is a very blurry line between “platform” and “product.” The product is the platform.

Because the product is the platform, engineers at Knock orient around key business- and customer-facing metrics, and work from there to achieve greater scale, resilience, and performance for our customers & partners.

You will have a high degree of ownership and autonomy in improving the Knock platform, starting with our foundational infrastructure. We’re an engineer-led team. We value shipping high-quality product at a fast pace.

We care deeply about building a team and culture that is inclusive and equitable for people of all backgrounds and experiences, and believe firmly that the best teams are diverse. We particularly encourage people from underrepresented communities to apply.

Last thing: you can be a great fit even if you don't perfectly match what's described below. We know there's a lot we don't know and haven't thought of yet, and we're looking for teammates that can tell us what those things are. If that's you, don't hesitate to apply and tell us about yourself!

What you’ll be doing in this role

As an early stage company, everyone (including you) is involved in building every part of the company from the product to how we get work done internally. Here are a collection of hats we need you to be OK with wearing:

  1. Adopting a Terraform-backed EKS cluster, modernizing & maintaining it for elastic scale, reliability, performance, security, etc.

  2. Going deep into troubleshooting Postgres performance, queues of every shape and size, and come out the other side with a plan for scaling another 10x to 100x. You know what Little’s Law is and you know how to use it.

  3. Identifying and correcting scaling issues before they affect our customers by relying on and improving our telemetry in Datadog & AWS Cloudwatch. If you see a blind spot, you are comfortable getting into the codebase to fix it.

  4. Maintaining and improve upon our >99.95% uptime track record.

  5. Exploring how we ship customer value and how we can improve that process through canaries, improved cycle time, blue/green deploys, etc.

  6. Taking all of this and replicating it in multiple AWS regions.

  7. Joining on-call rotations on a schedule with the rest of the engineering team.

  8. You will be shaping culture and practices for future hires in DevOps and across the company. Knock places deep trust in each team member to self organize and do what is best for our customers and each other.

This position is both high autonomy and high accountability: you will have a lot of room to work and raise our existing standards, while also communicating those changes and bringing the rest of the team along for the ride, often in the form of runbooks & internal documentation.

What we’re looking for in this role

  • 3+ years of experience working in and on production Kubernetes clusters using infrastructure as code (ideally Terraform, but others are fine too).

  • 3+ years experience working on complex AWS deployments (multi-account, complex VPC structure to support EKS, EKS experience).

  • 3+ years production experience running Postgres at scale (especially AWS RDS or Aurora), including analyzing query plans, managing replication slots, table partitioning, and indexing strategies.

  • 3+ years experience supporting production queueing systems like Kafka, Kinesis, Rabbit, SQS, etc.

  • You care deeply about building elegant systems that are delightful to interact with on every level, especially when it comes to API performance & latency.

  • You like the idea of joining an early-stage team where you can play a meaningful part in shaping the direction of the company, product, and culture.

  • You might have some prior experience writing or deploying Elixir, but this a nice-to-have. You will have opportunities to develop your Elixir skills and provide application-level improvements as part of this role.


Job Profile


North America


United States


AWS Cloudwatch Datadog Postgres Terraform

  • Identifying scaling issues
  • Improving customer value shipping process
  • Joining on-call rotations
  • Maintaining uptime track record
  • Modernizing & maintaining EKS cluster
  • Replicating processes in multiple AWS regions
  • Troubleshooting Postgres performance

America/Anchorage America/Chicago America/Denver America/Los_Angeles America/New_York Pacific/Honolulu UTC-10 UTC-5 UTC-6 UTC-7 UTC-8 UTC-9