Incident Manager - Distributed US
Remote, United States
Lots of tech companies disrupt. But, many fail when they try to scale. We're different. CockroachDB makes it easier for companies to build and scale apps. This is how and why we're helping some of the most innovative companies on the planet. We tackle problems head-on and focus on solutions that create lasting impact.
Because when our customers win, we all win.
The Role
As an Incident Manager at Cockroach Labs, you will oversee the resolution of all types of incidents across internal, hosted cloud, on-premises customer environments, and security/compliance areas. Your responsibilities will include owning incident escalations, documenting processes, maintaining clear communication with customers and stakeholders, and collaborating with cross-functional teams to identify root causes and implement strategies to prevent future incidents. As the founding Incident Manager, you will play a crucial role in shaping the future of Incident Management at Cockroach Labs. You will:
- Manage the full lifecycle of incidents from identification through resolution, ensuring adherence to established incident management protocols across various mediums including cloud-hosted and fleet-wide incidents, customer-hosted cluster incidents, and security incidents.
- Lead and coordinate response efforts across various teams to ensure timely and effective incident resolution.
- Act as an escalation point for critical incidents and assist in leading crisis response processes as required.
- Drive root cause investigations for high impact/high visibility issues.
- Manage communications tailored to both technical and non-technical audiences, including internal and external, customer-facing stakeholders, about incident status, impact, and resolution progress.
- Conduct post-incident reviews with cross-functional teams, identifying actionable insights and process optimizations.
- Monitor, evaluate, and report on incident management programs, identifying trends and areas for improvement.
- Assist in the design and implementation of new processes and procedures to handle business growth and maturation.
- Provide rotational on-call support (24x7x365) to ensure incidents are handled promptly and effectively.
The Expectations
In your first 30 days, you will familiarize yourself with CockroachDB, our customers, and our company. We will provide some self-guided onboarding with reading and hands-on material to familiarize yourself with the company and some of the responsibilities of the role. During this period, you will also start to get acquainted with our incident management protocols and tools, and begin shadowing incident response activities to observe and learn from other team members with an eye to future improvements and optimizations.
After 60 days, you will be integrated into the company and will be familiar with the various systems we use. You will be able to manage incidents from both internal and customer environments and will be actively contributing to the Incident Management program. You will start leading incident response efforts, conducting root cause analyses, and participating in post-incident reviews. Additionally, you will begin to assist in refining and optimizing our incident management processes and documentation. You will be assisting management in planning team expansion and scale.
You Have:
- Bachelor’s degree in Computer Science, Information Technology, a related field, or equivalent work experience.
- 2+ years of experience in Incident Management, including leadership of high-severity incidents.
- 7+ years of experience in a technical role.
- Proficiency in troubleshooting techniques and problem-solving in a global 24x7x365 environment.
- Strong analytical and problem-solving skills, with the ability to conduct thorough root cause analysis.
- Excellent verbal and written communication skills, with the ability to convey complex information clearly to both technical and non-technical stakeholders.
- Scripting skills in Bash, JavaScript, Python, or equivalent languages, with the ability to develop scripts and tools to enhance problem management processes.
- Willing to be flexible with working hours depending on the needs of the business.
Preferred Qualifications:
- Proven ability to lead incident response calls confidently, driving toward resolution and minimizing downtime.
- Strong interpersonal and influencing skills to collaborate effectively across teams without direct authority.
- Strong understanding of IT service management principles and incident management best practices.
- Experience with Incident Management software.
- Familiarity with leading investigations in an enterprise environment.
- Regulatory clearance to work in a regulated environment, with a strong understanding of compliance requirements and adherence to regulatory standards.
- Experience with security and compliance related incident response.
- Working knowledge and applied skills in ITIL, Change, Incident and Problem Management.
- ITIL certification
- Technical certifications
Cockroach Labs is proud to be an Equal Opportunity Employer building a diverse and inclusive workforce. If you need additional accommodations to feel comfortable during your interview process, please email us at accessibility@cockroachlabs.com.
Cockroach Labs has a hybrid work model, with Roachers that are local to one of our offices coming in on Mondays, Tuesdays, and Thursdays and working flexibly the rest of the week. While we’ve learned valuable lessons working remotely, nothing can replace the connection, creativity, and fun that occurs when Roachers get together and we are committed to fostering a workplace that encourages collaboration and allows us all to do our best work.
Benefits- Stock Options
- Medical Insurance
- Vision Insurance
- Dental Insurance
- Life and Disability Insurance
- Professional Development Funds
- Flexible Time Off
- Paid Holidays
- Paid Sick Days
- Paid Parental Leave
- 401(k) Plan
- Mental Wellbeing Benefits
- And more!
#LI-Remote
The annual anticipated base salary range for U.S. candidates for this role is listed in USD below. Salary is one component of the Cockroach Labs’ Total Rewards package, which also includes, for each employee: stock options, medical insurance, vision insurance, dental insurance, life and disability insurance, funds towards professional development resources, flexible paid time off, 11 paid holidays a year, 10 paid sick days a year, paid parental leave, a 401(k) plan, and wellbeing benefits.
We set standard ranges for all U.S.-based roles based on function, level, and geographic location, benchmarked against similar stage growth companies. Actual salaries may vary and fall outside of this range depending on factors such as a candidate’s qualifications, geographic location, skills, experience, and competencies. In addition, we are often open to a wide variety of profiles, and recognize that the person we hire may be less experienced (or more senior) than this job description as posted.
Salaries for candidates outside the U.S. will vary based on local compensation structures.
This position will remain posted until filled. Applicants should apply via our Careers Page.
Annual Anticipated Base Salary Range (U.S)$116,000—$154,000 USD ApplyJob Profile
Career development opportunities Dental Insurance Flexible hours Flexible time off Hybrid work Hybrid work model Life and Disability insurance Medical Insurance Mental wellbeing benefits Paid holidays Paid parental leave Paid Sick Days Professional development Professional development funds Remote work Stock options Vision Insurance
Tasks- Conduct post-incident reviews
- Lead response efforts
- Manage incident lifecycle
- Monitor incident management programs
- Oversee incident resolution
Cockroachdb Communication Cross-functional Collaboration Documentation Incident Management Problem-solving Process Optimization Python Root Cause Analysis Troubleshooting
Experience7 years
EducationBachelor's degree Computer Science Information Technology
TimezonesAmerica/Anchorage America/Chicago America/Denver America/Los_Angeles America/New_York Pacific/Honolulu UTC-10 UTC-5 UTC-6 UTC-7 UTC-8 UTC-9