AI DevOps Support Engineer - REMOTE
Atlanta, GA, US
Req ID: 316845
NTT DATA strives to hire exceptional, innovative and passionate individuals who want to grow with us. If you want to be part of an inclusive, adaptable, and forward-thinking organization, apply now.
We are currently seeking a AI DevOps Support Engineer - REMOTE to join our team in Atlanta, Georgia (US-GA), United States (US).
As an AI Platform Specialist, these roles will provide application and GPU support. The team will deliver Tier 1 and Tier 2 support to developers and engineers while collaborating closely with Tier 3 and 4 platform teams and vendors for issue resolution. The roles require user knowledge of Kubernetes, virtualization, and cloud-native technologies as well as operator knowledge of GPUs and other AI supporting services. Each specialist should have a focus on customer service along with goals of reliability, scalability, and performance.
Day to Day Responsibilities:
- Platform Support & Incident Response
- Provide Tier 1 & Tier 2 support for AI-driven applications and workloads.
- Troubleshoot and resolve issues related to Kubernetes deployments, GPU utilization, and service performance.
- Collaborate with Tier 3+ teams, including Kubernetes engineers and external vendors, to escalate and resolve complex issues.
- Kubernetes & Cloud-Native Operations
- Full adoption, creation, and integrations into automated services using Helm, Ansible, Terraform, etc.
- Deploy, manage, and support containerized AI workloads on Google Anthos-powered Kubernetes clusters.
- Ensure adherence to pod security policies, automated rollouts/rollbacks, and best practices for scalable and secure Kubernetes environments.
- GPU Infrastructure & AI Services Management
- Optimize and support GPU-enabled workloads including CUDA and other AI acceleration frameworks.
- Assist in the installation, configuration, and support of AI coding assistants (e.g., Codeium).
- Observability & Documentation
- Maintain detailed operational documentation, runbooks, and troubleshooting guides.
- Utilize monitoring/logging tools like New Relic, Big Panda, Prometheus, Grafana, and other observability frameworks.
- Process Improvement & Collaboration
- Work cross-functionally with developers, IT teams, and vendors to ensure seamless deployment and support of AI services.
- Contribute to CI/CD pipelines, automation, service, and security best practices.
- Track and communicate work through task management platforms (ServiceNow and Jira).
Minimum Requirements:
- 5+ years with hybrid Cloud – In-depth knowledge of private (on-premises) and public (GCP & AWS) cloud architectures and services.
- 5+ years developer experience with …
This job isn't fresh anymore!
Search Fresh JobsJob Profile
Remote role
Benefits/PerksInclusive work environment Opportunities for growth Remote work flexibility
Tasks- Collaborate
- Collaborate with tier 3+ teams
- Communication
- Configuration
- Documentation
- Implementation
- Maintain operational documentation
- Optimize GPU workloads
- Process Improvement
- Provide tier 1 & tier 2 support
- Resolve issues
- Support
- Troubleshooting
- Work cross-functionally with teams
AI AI acceleration frameworks Ansible Applications Artificial Intelligence Automation AWS Big Panda CI/CD Click Cloud Cloud Native technologies Coding Collaboration Communication Consulting CUDA Customer service DevOps Documentation GCP Git GPU support Grafana Helm Implementation Incident Response Integrations Jenkins Jira Jupyter notebooks Kubernetes Logging Monitoring New Relic Process Improvement Prometheus Python PyTorch R Reporting Security Security Best Practices ServiceNow Teams Technical Support TensorFlow Terraform Time Management Troubleshooting
Experience5 years
Education Certifications TimezonesAmerica/Anchorage America/Chicago America/Denver America/Los_Angeles America/New_York Pacific/Honolulu UTC-10 UTC-5 UTC-6 UTC-7 UTC-8 UTC-9