Operations Engineer, HPC Network
Livingston, NJ / New York, NY / Sunnyvale, CA / Bellevue, WA / Richmond, VA
CoreWeave is the AI Hyperscalerâ˘, delivering a cloud platform of cutting edge services powering the next wave of AI. Our technology provides enterprises and leading AI labs with the most performant, efficient and resilient solutions for accelerated computing. Since 2017, CoreWeave has operated a growing footprint of data centers covering every region of the US and across Europe. CoreWeave was ranked as one of the TIME100 most influential companies of 2024.
As the leader in the industry, we thrive in an environment where adaptability and resilience are key. Our culture offers career-defining opportunities for those who excel amid change and challenge. If youâre someone who thrives in a dynamic environment, enjoys solving complex problems, and is eager to make a significant impact, CoreWeave is the place for you. Join us, and be part of a team solving some of the most exciting challenges in the industry. Â
CoreWeave powers the creation and delivery of the intelligence that drives innovation.Â
About the Role
At CoreWeave we are seeking a dedicated and detail-oriented Operations Engineer to join our HPC Networking Team. HPC Networking at CoreWeave is tasked with developing and operating some of the largest InfiniBand fabrics, powering industry leading AI workloads.Â
What Youâll Do
In this role, you will support the deployment, monitoring, troubleshooting, and maintenance of large-scale InfiniBand fabrics, ensuring their stability and performance. The ideal candidate will have a strong operations mindset, effective collaboration skills, and the ability to solve complex issues in a dynamic environment.
- Regularly monitor the performance and health of InfiniBand fabrics, including switches, host adapters, and nodes.
- Investigate and resolve operational issues within InfiniBand fabrics, such as network connectivity problems and performance bottlenecks.
- Assist with the installation and operational bring-up of large InfiniBand fabrics in collaboration with onsite personnel and customer teams.
- Perform routine maintenance and upgrades on InfiniBand switches and control plane components.
- Collaborate with HPC cluster operations teams to provide troubleshooting and operational expertise.
Investing in our people is one of our top priorities, and we value candidates who can bring their diversified experiences to our teams. Here are some qualities weâve found compatible with our team. We'd love to talk about whether this aligns with your experience and Interests and what youâre excited to work on next.
Who You Are
Minimum Qualifications
- At least 1 year âŚ
This job isn't fresh anymore!
Search Fresh JobsJob Profile
Hybrid workplace
Benefits/PerksCareer defining opportunities Catered lunch Collaborative environment Competitive salary Disability Insurance Dynamic environment Dynamic work environment Family-forming support Flexibility Flexible PTO Flexible Spending Account Health savings account Hybrid work Hybrid workplace Investment in people Life Insurance Mental wellness benefits Onboarding training Paid parental leave Remote work Significant impact Tuition reimbursement Vision Insurance Work-life balance
Tasks- Collaborate with teams
- Monitoring
- Support
- Troubleshooting
- Troubleshoot operational issues
AI Ansible Automation Bash Benefits Best Practices Cloud Collaboration Compensation Data center Data Center Operations Data centers Excel Grafana HPC HPC networking Infiniband Innovation Linux Linux System Administration Management Monitoring Networking Networking concepts Next NVIDIA Onboarding Operations Prometheus Python Scripting SLURM System Administration Troubleshooting
Experience1 years
Education TimezonesAmerica/Anchorage America/Chicago America/Denver America/Los_Angeles America/New_York Pacific/Honolulu UTC-10 UTC-4 UTC-5 UTC-6 UTC-7 UTC-8 UTC-9