FreshRemote.Work

AI DevOps Support Engineer - REMOTE

Atlanta, GA, US

Req ID: 316845 

NTT DATA strives to hire exceptional, innovative and passionate individuals who want to grow with us. If you want to be part of an inclusive, adaptable, and forward-thinking organization, apply now.

We are currently seeking a AI DevOps Support Engineer - REMOTE to join our team in Atlanta, Georgia (US-GA), United States (US).

As an AI Platform Specialist, these roles will provide application and GPU support. The team will deliver Tier 1 and Tier 2 support to developers and engineers while collaborating closely with Tier 3 and 4 platform teams and vendors for issue resolution. The roles require user knowledge of Kubernetes, virtualization, and cloud-native technologies as well as operator knowledge of GPUs and other AI supporting services. Each specialist should have a focus on customer service along with goals of reliability, scalability, and performance.

 

Day to Day Responsibilities:

  • Platform Support & Incident Response
    • Provide Tier 1 & Tier 2 support for AI-driven applications and workloads.
    • Troubleshoot and resolve issues related to Kubernetes deployments, GPU utilization, and service performance.
    • Collaborate with Tier 3+ teams, including Kubernetes engineers and external vendors, to escalate and resolve complex issues.
  • Kubernetes & Cloud-Native Operations
    • Full adoption, creation, and integrations into automated services using Helm, Ansible, Terraform, etc.
    • Deploy, manage, and support containerized AI workloads on Google Anthos-powered Kubernetes clusters.
    • Ensure adherence to pod security policies, automated rollouts/rollbacks, and best practices for scalable and secure Kubernetes environments.
  • GPU Infrastructure & AI Services Management
    • Optimize and support GPU-enabled workloads including CUDA and other AI acceleration frameworks.
    • Assist in the installation, configuration, and support of AI coding assistants (e.g., Codeium).
  • Observability & Documentation
    • Maintain detailed operational documentation, runbooks, and troubleshooting guides.
    • Utilize monitoring/logging tools like New Relic, Big Panda, Prometheus, Grafana, and other observability frameworks.
  • Process Improvement & Collaboration
    • Work cross-functionally with developers, IT teams, and vendors to ensure seamless deployment and support of AI services.
    • Contribute to CI/CD pipelines, automation, service, and security best practices.
    • Track and communicate work through task management platforms (ServiceNow and Jira).

Minimum Requirements:

  • 5+ years with hybrid Cloud – In-depth knowledge of private (on-premises) and public (GCP & AWS) cloud architectures and services.
  • 5+ years developer experience with …
This job isn't fresh anymore!
Search Fresh Jobs