Senior Site Reliability Engineer - GeForce Now

US, CA, Remote

NVIDIA

Published 5 months ago

Hey, this job isn't fresh anymore! 👉 Find fresh remote jobs here

NVIDIA has been redefining computer graphics, PC gaming, and accelerated computing for more than 25 years. It’s a unique legacy of innovation that’s motivated by outstanding technology and amazing people. Today, we’re tapping into the unlimited potential of AI to define the next era of computing. Doing what’s never been done before takes vision, innovation, and the world’s best talent. As an NVIDIAN, you’ll be immersed in a diverse, encouraging environment where everyone is inspired to do their best work.

NVIDIA is looking for a Senior Site Reliability Engineer (SRE) to join its cloud service team for supporting, triaging, and building generative AI-powered visual applications. SREs are responsible for the big picture of how our systems relate to each other, we use a breadth of tools and approaches to solve a broad spectrum of problems. We live SRE practices that are key to product quality, such as limiting time spent on reactive operational work, blameless postmortems, proactive identification of potential outages, and iterative improvements, which all make for exciting and multi-faceted day-to-day work. The person in this position will be responsible for Service Response and workflow and will drive tools/service development to maintain and improve service SLOs. We partner with Service Owners to drive the reliability of the service.

What you will be doing:

Support and work on groundbreaking Generative AI inferencing and training workloads running in a globally-distributed heterogeneous environment that spans all major cloud service providers. Ensure the best possible performance and availability on current and next-generation GPU architectures.
Collaborate closely with the service owner, architecture, research, and tools teams at NVIDIA to achieve ideal results for AI problems at hand.
Monitoring & supporting critical high-performance, large-scale services running multi-cloud.
Participate in the triage & resolution of sophisticated infra-related issues.
Maintain services once live by measuring and monitoring availability, latency, and overall system health using metrics, logs, and traces.
Scale systems sustainably through mechanisms like automation and evolve systems by pushing for changes that improve reliability and velocity.
Practice balanced incident response and blameless postmortems.
Be part of an on-call rotation to support production systems.
Lead significant production improvement around tooling, automation, and process.
Architect, design, and code using your expertise to optimize, deploy and productize services.

What we need to see:

8+ years of demonstrated experience operating & owning end-to-end availability and performance of critically important services in a live-site production environment, either as an SRE or …

This job isn't fresh anymore!

Search Fresh Jobs

Job Profile

Benefits/Perks

Benefits Competitive salaries Diverse environment Diversity Eligible for Equity Equity Equity and benefits Innovative projects

Tasks

Training

Skills

Accelerated Computing AI Analytical Automation AWS Azure Cloud Services Communication Computer graphics Containerization CUDA Deep Learning ELK stack GCP Generative AI Go GPU Incident Management Kubernetes Machine Learning Microservices NVIDIA Performance monitoring Presentation Prometheus Python PyTorch Site Reliability Engineering TensorFlow Training

Experience

8 years

Education

Bachelor's Equivalent experience

Timezones

UTC-8

Remote Jobs in North America Remote Jobs in Europe Remote Jobs in Asia/Pacific Remote Jobs in South America Remote Jobs in Africa Remote Jobs in Middle East Full Time Remote Jobs Part Time Remote Jobs Contract Remote Jobs Internship Remote Jobs Temporary Remote Jobs Freelance Remote Jobs Mid-Level Remote Jobs Senior-Level Remote Jobs Entry-Level Remote Jobs Exec-Level Remote Jobs Lead-Level Remote Jobs Junior-Level Remote Jobs