Site Reliability Engineer
Remote
About Thalamus
Our mission is to help the right doctors practice at the right hospitals to treat the right patients. We leverage a passion for technology, medical education, equity, and data-driven research to optimize physician recruitment, starting with the medical residency recruitment process.
Our philosophy is that the opportunity to practice medicine in an ideal environment should be accessible to all, and ample medical research has shown that this results in patients getting better healthcare outcomes overall. We built a comprehensive interview management platform, backed by evidence-based research, to innovate, streamline, and optimize the residency recruitment process.
At Thalamus, our SRE will lead our cloud infrastructure transformation initiatives. In this role, you will be responsible for architecting, implementing, and optimizing our reliability strategy across all platforms, with a focus on driving our cloud infrastructure modernization efforts. The successful candidate will lead cross-functional teams to design and implement observability solutions, establish automated infrastructure provisioning, and create consistent environments leveraging Kubernetes.
You will...
- Technical Leadership: Provide expert technical guidance for cloud infrastructure, observability, and reliability engineering practices
- Architecture Design: Design and implement a scalable, resilient cloud architecture leveraging Kubernetes ecosystem technologies
- Observability Strategy: Lead the implementation of comprehensive monitoring and telemetry solutions to provide visibility across the entire technology stack
- Automation Excellence: Champion infrastructure-as-code methodologies and implement repeatable, automated deployment patterns
- Disaster Recovery: Develop and improve business continuity/disaster recovery strategies and solutions
- Team Leadership: Mentor Staff SREs and other engineers on cloud-native technologies and reliability best practices
- Cross-team Collaboration: Partner with development teams to establish and maintain effective Service Level Objectives
You should have...
- 8+ years of experience in infrastructure engineering, with at least 3 years in a senior leadership position
- 10+ years of AWS experience and 5+ years of Azure experience
- Deep expertise with Kubernetes orchestration and ecosystem technologies
- Extensive experience implementing observability solutions (metrics, logging, tracing, alerting)
- Strong background in infrastructure automation using Terraform, Helm, or equivalent tools
- Experience architecting high-availability systems in cloud environments
- Track record of leading significant infrastructure initiatives and driving architectural decisions
- Exceptional communication skills with the ability to explain complex technical concepts to diverse audiences
Bonus
- Experience with multi-cloud and hybrid cloud architectures
- Knowledge of DataDog or Prometheus observability stacks
- Experience migrating workloads from traditional platforms to Kubernetes
- Background implementing GitOps workflows for infrastructure and application deployment
- Knowledge of service mesh technologies (Istio, Linkerd, etc.)
- Experience implementing zero-trust security models in cloud environments
The salary range for this position is $200,000 - $250,000 and a grant of stock options. Final compensation will be determined based on experience, skills, and geographic location.
Our Commitment ...
Thalamus is a mission-driven organization centered on the belief that our company should model what we want of the US healthcare system, that the diversity of providers aligns with patient populations. We believe this is best achieved by building a team with a diversity of backgrounds, cultures, and experiences, including “distance traveled.” Thalamus is an equal opportunity employer. We do not discriminate based upon race, religious creed, color, national origin, ancestry, physical or mental disability, medical condition, genetic information, marital status (including registered domestic partnership status), sex and gender (including pregnancy, childbirth, lactation, and related medical conditions), gender identity and gender expression (including transgender individuals who are transitioning, have transitioned, or are perceived to be transitioning to the gender with which they identify), age, sexual orientation, Civil Air Patrol status, military and veteran status, and any other consideration protected by federal, state, or local law. We encourage those who really want to make an impact and who exemplify our core values to apply for our open positions.
Actual base salary offered will be determined by: experience, skills, and work location. This range is for base salary, our total compensation includes equity and benefits. We welcome you to apply even if your expectations are outside our listed range.
Thalamus is committed to providing reasonable accommodations for qualified individuals with disabilities in our job application procedures and throughout employment. If you need assistance or any accommodation, please let us know.
Thalamus does not accept unsolicited resumes from recruiters or employment agencies without a fully executed recruitment agreement in place. In the absence of such agreement, Thalamus reserves the right to pursue and hire any candidates without an obligation to pay fees. Agencies are requested not to contact Thalamus hiring managers or employees regarding recruiting services.
*This position is based in the United States, and you must be legally authorized to work in the United States.
ApplyJob Profile
Tasks- Architect and implement reliability strategy
- Champion infrastructure-as-code
- Collaborate with development teams
- Design scalable cloud architecture
- Develop disaster recovery strategies
- Implement observability solutions
- Lead cloud infrastructure transformation
- Mentor engineers
AWS Azure Cloud Infrastructure GitOps Helm High-Availability Systems Infrastructure Automation Kubernetes Monitoring Observability Reliability Engineering Service Mesh Telemetry Terraform Zero Trust Security
Experience8 years