FreshRemote.Work

Staff Software Engineer - Infrastructure Monitoring

Boston, MA, United States, New York, NY, United States, Remote

Datadog is seeking an experienced Staff Engineer to join our Infrastructure Monitoring team. We are looking for a Staff Engineer with deep GPU experience (development + operations) to help build out GPU-specific observability capabilities in our Infrastructure Monitoring products. This role will directly shape Datadog’s approach and posture towards building observability tooling for customers leveraging GPUs in their infrastructure. Example problems this person will solve are “How can we detect runtime issues over a fleet of GPUs, isolate the root cause, and provide actionable recommendations to resolve the issue?” and “How can we profile and optimize software running on GPUs?” This will include significant cross teamwork and collaboration with a number of Datadog product and platform teams, requiring the ability to go deep across many different product stacks.

What You'll Do:

  • Develop a company-wide approach to GPU Observability across the 3 Pillars - Metrics, Logs, and Traces
  • Collaborate with cross-functional teams to design and develop GPU-centric product offerings
  • Drive high-priority, high-visibility products that expand Datadog’s penetration into the GPU market
  • Lead architectural decisions for new and existing GPU-based observability products
  • Identify opportunities for Datadog product enhancements to provide coverage for GPUs
  • Contribute to short- and long-term planning and roadmap development

Who You Are:

  • You have several years of experience leading cross-team initiatives in a platform or infrastructure-focused environment
  • You have a deep understanding of, have developed for, and operated GPUs in production environments
  • You are deeply familiar with at least one of the following areas - Data Science, Graphics Programming, Large Language Models
  • You have significant back-end programming experience and have architected, built, and operated distributed systems to solve problems at high scale
  • You possess a deep understanding of the day-to-day responsibilities of an engineer and have a strong technical background
  • You have excellent verbal and written communication skills and are comfortable presenting and defending your ideas to both technical and non-technical audiences
  • You have a BS/MS/PhD in a Computer Science, Engineering or related scientific field or equivalent experience

Datadog offers a competitive salary and equity package, and may include variable compensation. Actual compensation is based on factors such as the candidate's skills, qualifications, and experience. In addition, Datadog offers a wide range of best in class, comprehensive and inclusive employee benefits for this role including healthcare, dental, parental planning, and mental …

This job isn't fresh anymore!
Search Fresh Jobs