Site Reliability Engineer II
Remote - USA
About The Role
Enterprises of all sizes trust Abnormal Security’s cloud products to stop cybercrime. These products must scale with the growth of our customers, and ensure reliability and availability by being resilient. This is where our SRE fits in, ensuring the prevention, detection, efficient remediation, and quick recovery from outages that impact the Abnormal Security Platform.
Come empower the rest of engineering to stop cybercrime as we expand our offerings across both clouds and regions.
There are a lot of opportunities for growth and career advancement – it’s up to you to own your career here. Some potential career paths for this role include:
- Positioning yourself to be a founding member of a team that will have an outsized impact on the rest of the company.
- Growing into a Senior technical leadership role.
What You Will Do
- Deployment Operations
- Build tools and processes to standardize deployment of Abnormal Security product suite in a multi-datacenter setup.
- Partner with R&D teams to develop pre and post deployment checklists, canary test environments and workflows, and safe rollback processes.
- Incident Prevention
- Identify gaps in existing processes and advocate for necessary changes to improve overall system stability and availability.
- Lead the Production Readiness Review process to ensure the resilience of systems before customer deployment.
- Oversee the Critical Change Management Review process for the safe application of changes to critical services.
- Develop and enforce architecture guidelines to minimize downtime and ensure high system availability.
- Detection
- Establish consistent definition of metrics for “Is this product working”.
- Define and monitor SLAs/SLOs for critical systems, actively tracking deviations and triggering alerts when necessary.
- Remediation
- Define incident severity classification guidelines and implement incident response protocols to promptly address issues and reduce downtime.
- Facilitate effective communication between Engineering and Customer Success teams during incidents.
- Incident Recovery
- Design and implement tools to expedite system recovery and minimize the impact of incidents.
- Develop guidelines for Post Mortems after incidents to prevent recurrence.
Must Have
- Bachelor’s in Computer Science, Computer Engineering, or equivalent professional experience
- 1+ experience as a Site Reliability Engineer, responsible for the reliability of shared services
- Experience with a public cloud provider (AWS, Azure, GCP), observability stack (Prometheus, …
This job isn't fresh anymore!
Search Fresh JobsJob Profile
Remote - USA
Benefits/PerksBenefits Benefits & Perks Benefits & Perks page Bonus Bonus eligibility Career growth opportunities Compensation packages Comprehensive benefits Equity philosophy Individual compensation packages Restricted Stock Units RSUs
Tasks- Identify process gaps
AWS Azure Change Management Cloud Cloud Computing Communication Customer Success GCP Grafana Helm Incident Management Infrastructure Infrastructure as Code Kubernetes PagerDuty Prometheus Security Sentry Site Reliability Engineering Slack Terraform
Experience1 years
EducationBachelor’s in Computer Engineering Bachelor's in Computer Science Computer Engineering Computer Science Engineering Equivalent professional experience
TimezonesAmerica/Anchorage America/Chicago America/Denver America/Los_Angeles America/New_York Pacific/Honolulu UTC-10 UTC-5 UTC-6 UTC-7 UTC-8 UTC-9