Service Reliability Operations Administrator
US, CA, Remote
NVIDIA's NGC team is looking for highly motivated System Administrator/DevOps engineers to design, develop and implement a global, dynamic, innovative Service Reliability Operations Center (known as Mission Control), to provide extraordinary levels of support for our Cloud products and services. As a key member of the Mission Control team, you will partner with other key members of our organization including Site Reliability Engineering, Security Operations Center, DevOps teams, and other partners to help make our services capable of providing near 100% availability.
On the rare occasion that an incident occurs, you will be our front line to decrease the frequency and duration of any issue. Working in partnership with the development community the Mission Control team will develop monitors, alarms, and alerts to help make the service more reliable and improve our customer experience. Additionally, you will be very involved in selecting the technologies that we will use in the Mission Control to help monitor, run, and measure the effectiveness of the environment.
What you will be doing:
The team will provide their services 24/7 with a follow-the-sun environment which will span continents.
You will report directly to a manager in the United States.
Each team member will need to work either a Saturday or Sunday each week. The hours worked may include an early or late start (10hrs-per-day x 4 days-per-week schedule) to ensure that the combination the US and India teams provide 24/7 coverage.
The heart of Mission Control will be monitoring and running a growing production compute and storage environments.
Every Mission Control team member will use alerts and alarms to help prevent issues and incidents when possible. You may also work with the developer community to develop and implement predictive support or diagnostic routines.
Perform systems administration tasks, network administration tasks, security incident monitoring to drive our actions.
Mission Control team members will work with developers to learn how the service works, then translate that understanding into runbooks which the entire team will use. As new features and functionality are added, you will also update and evolve the runbooks as needed.
Help discover incidents and issues, including initiating the incident management procedure.
Bring in subject matter authorities or service owners as needed to resolve issues. Feedback will help us continually improve our service.
Your interpersonal skills will help keep the team engaged through resolution and ensure our clients believe we value their time and effort.
May …
This job isn't fresh anymore!
Search Fresh JobsJob Profile
Benefits Diversity Eligible for Equity Equity Equity and benefits Flexible hours Remote work Work environment
Tasks- Incident management
- Monitor production environments
Analytical Ansible Automation Cloud products Cloud Services Compute Container Orchestration Containers DevOps DHCP DNS Engineering Git Incident Management Interpersonal Monitoring Monitoring tools Networking NVIDIA Operations Orchestration Problem-solving Python Scripting Security Security Operations Servers Shell scripting Site Reliability Engineering Storage System Administration Troubleshooting Virtual Machines
Experience5 years
EducationB.S. Engineering Equivalent Equivalent experience Operations
Certifications TimezonesAmerica/Anchorage America/Chicago America/Denver America/Los_Angeles America/New_York Pacific/Honolulu UTC-10 UTC-5 UTC-6 UTC-7 UTC-8 UTC-9