Senior Site Reliability Engineer

Remote - USA - OR

Lytx

USD 132K+ Full Time Senior

Company preview All jobs at Lytx

Search Fresh Jobs Job profile

Published 3 months ago

Hey, this job isn't fresh anymore! 👉 Find fresh remote jobs here

Why Lytx:

We are a team of Hungry, Low ego and capable engineers that design and support our IOT Infrastructure. We are growing rapidly and migrating to the cloud! Are you interested in "Operations as Code", "Infrastructure as Code" and infrastructure automation solutions? If so keep reading....

Site Reliability Engineering team is responsible for the availability, reliability, observability and resilience of Infrastructure and related automation of the entire fleet of servers on-prem and the expanding cloud posture of the organization. This team’s responsibilities are very critical to the continuity of business of the organization. If you love crafting new solutions and building a scalable cloud and on-prem infrastructure, then this role may be an excellent match for you!

Responsibilities:

Build tools and frameworks to monitor systems and ensure highest level of uptime on production environments.
Participate and improve our 24/7 on call and incident management process. Build and maintain Run-books. Contribute to the design and documentation of the cloud services and SOPs.
Work closely with Architects, DBAs, Developers, DevOps, Data engineers from design to production while building reliable, scalable and cost optimized services.
Collaborate with Service Owners to define the SLOs and build SLIs to ensure systems are meeting the SLAs.
Participate in blameless post-mortems. Assist in publishing RCA documents for internal and external consumption.
Reduce Operational Toil and maintain high degree of automation by adapting IaC first and Gitops principals.
Acquire and maintain significant understanding of Lytx production services to ensure timely resolution of production incidents.

Requirements:

5+ years of experience as a SRE in an AWS environment at medium to large scale organization.
5+ years of hands-on experience implementing and managing Observability tools (Prometheus, New Relic, Grafana, etc.)
High level of programing proficiency, preferably using Python, groovy and bash.
Good understanding of database technologies (SQL and NoSQL)
3+ years of experience building Infrastructure deployment pipelines leveraging git, Terraform, Helm, Jenkins/JenkinX/ArgoCD etc.
Proven experience in designing production environments in AWS cloud using various AWS services (VPCs, EKS, IAM, AMI, EC2, CloudWatch, CloudTrail’s, Control Tower, Guard duty, MSK, S3, Glacier, Gateways, Direct Connects, Route53, RDS, ALBs, Autoscaling etc)
Hands-on experience with Linux systems and various protocols and technologies (HTTP, REST, TCP/IP, SSL, DNS, SMTP, SSH, NTP, Load Balancing, SQL/NoSQL, Message Brokers, Nginx, Vault etc)
Hands-on experience with Kubernetes and various container and cloud native technologies.
Significant experience in participating, implementing, and managing 24-7 on call rotation for SRE team, creating …