FreshRemote.Work

Site Reliability Engineer, Cloud - Remote - United States

At Yugabyte, we are on a mission to become the default transactional database for the cloud. We are well underway on this journey with YugabyteDB, the open source, high-performance, distributed SQL database that runs on any cloud and enables developers to get instantly productive using well-known APIs. We are looking for talented and driven people to join us on our ambitious mission and help us build a lasting and impactful company.

We announced a $188M Series C round at a $1.3B valuation in October 2021, however we very much believe we are still in the early stages of our company’s journey. The transactional database market is estimated to grow from $40B in 2021 to $64B by 2025. Given our database is cloud-native by design, has on-demand horizontal scalability, and allows for geographical distribution of data using built-in replication, we are extremely well-positioned to address the market need for geo-distributed, high-scale, high-performance workloads.

The Role

As a Site Reliability Engineer focused on database availability and reliability you will be using your skills to operate and automate the life cycle of the YugabyteDB DBaaS.  You will design and build processes that will spin up systems and the infrastructure that manages the databases using secure, reliable, scalable and highly observable methodologies.  You will be using, operating, and configuring Kubernetes environments (GKE, EKS, AKS), Java frameworks, Shell scripts, Python scripts, Terraform templates and many other cloud technologies.  You will participate in the on-call rotation for 12 hours a day over 7 days, every 4-5 weeks and manage incidents on the DBaaS infrastructure coordinating support for our customers.  You will learn how to diagnose problems with our database and infrastructure technology and help deliver reliable service to our customers. 

We are looking for strong engineers who exemplify collaboration, teamwork, empathy and like to lead by example. We enjoy working with people who are driven and thrive in a fast-paced startup environment, and who have a strong desire to build an internet-scale, extensible control plane with strong emphasis on simplicity and user experience.  

Responsibilities

  • Design, develop, test, debug, troubleshoot, and maintain components of the DBaaS cloud infrastructure
  • Manage operational priorities of the DBaaS infrastructure
  • Establish process for handling and leading response to incidents on databases or infrastructure
  • Automate and manage regular maintenance operations such as upgrades etc.
  • Design and build DBaaS processes for encryption, security key/password management, storage management, etc. 
  • Utilize SRE golden signals to analyze and optimize the DBaaS system's performance and reliability strategies

Requirements

  • Strong software design and implementation skills in building infrastructure frameworks
  • Experience building and operating data systems for production applications, including fault tolerant designs, software lifecycles, and automation of critical operations
  • Strong track record of Incident Response and Management in a managed service which is mission critical for its customers
  • Experience with:
    • Relational Database systems (PostgresQL preferred)
    • Public cloud infrastructure (AWS, GCP, and/or Azure)
    • Containerization tooling, theory and design (Docker, Kubernetes)
    • Infrastructure as Code (Terraform preferred)
    • Configuration Management Tooling (Ansible preferred)
    • Automation Scripting (Python and Bash preferred)
    • Monitoring systems (Prometheus preferred)
    • Version control systems (git preferred)
    • CI/CD systems (GitHub Actions preferred)
  • Solid understanding of Linux systems operations and troubleshooting
  • Willingness and ability to learn new languages and concepts

Interview Process: Health and safety remain a top priority for all of our roles. As such, all Yugabyte interviews are held virtually, so we can all continue doing our part with social distancing and containment efforts. Although we are based in Silicon Valley, we hire exceptional folks wherever they are! Our process usually lasts 2-3 weeks, and consists of a phone screen, Zoom interviews including senior leaders.

Compensation and Benefits: We are committed to the principle of equal pay for equal work. The cash compensation for this role is market-competitive, ranging from $150,000 to $200,000. Additional benefits include equity options, comprehensive health plans, retirement benefits, and unlimited paid time off (PTO).
Equal Employment Opportunity Statement: As an equal opportunity employer, Yugabyte is committed to a diverse workforce. Employment decisions regarding recruitment and selection will be made without discrimination based on race, color, religion, national origin, gender, age, sexual orientation, physical or mental disability, genetic information or characteristic, gender identity and expression, veteran status, or other non-job related characteristics or other prohibited grounds specified in applicable federal, state and local laws.Equal Employment Opportunity Statement: As an equal opportunity employer, Yugabyte is committed to a diverse workforce. Employment decisions regarding recruitment and selection will be made without discrimination based on race, color, religion, national origin, gender, age, sexual orientation, physical or mental disability, genetic information or characteristic, gender identity and expression, veteran status, or other non-job related characteristics or other prohibited grounds specified in applicable federal, state and local laws.    To review Yugabyte's Privacy Policy please visit Yugabyte Privacy Notice. Apply

Job Profile

Regions

North America

Countries

United States

Benefits/Perks

Comprehensive health plans Equity options Health plans Retirement benefits Unlimited Paid Time Off

Skills

Cloud Technologies Databases Java Kubernetes Python Shell scripting Terraform

Tasks
  • Automate maintenance operations
  • Design and build DBaaS processes
  • Design, develop, test, troubleshoot, maintain cloud infrastructure
  • Establish incident response processes
  • Manage operational priorities
  • Utilize SRE golden signals for performance optimization
Timezones

America/Anchorage America/Chicago America/Denver America/Los_Angeles America/New_York Pacific/Honolulu UTC-10 UTC-5 UTC-6 UTC-7 UTC-8 UTC-9