{.body .main-container h1, .body .main-container h2, .body .main-container h3, .body .main-container h4, .body .main-container h5, .body .main-container h6 { margin-bottom: 20px !important;}}
top of page

Senior Compute Cluster Deployment Engineer

Nvidia

What you will be doing:

  • Primary responsibilities will include managing and maintaining AI/HPC infrastructure in Linux-based environments for new and existing customers.

  • Support operational and reliability aspects of large scale AI clusters with focus on performance at scale, real time monitoring, logging and alerting

  • Engage in and improve the whole lifecycle of services—from inception and design through deployment, operation and refinement.

  • Maintain services once they are live by measuring and monitoring availability, latency and overall system health

  • Provide feedback into internal teams such as opening bugs, documenting workarounds, and suggesting improvements.

  • Be part of an on call rotation to support production systems


What we need to see:

  • 5+ years providing in-depth support and deployment services, solving problems for hardware and software products.

  • Knowledge and experience with Linux System Administration, process management, package management, task scheduling, kernel management, boot procedures/troubleshooting, performance reporting/optimization/logging, network-routing/advanced networking (tuning and monitoring).

  • Cluster management technologies, EX: Bright Cluster Manager

  • Scripting proficiency.

  • Good social skills with the ability to maintain and deliver resolutions for customer blocking issues as they arise.

  • Superb communication and presentation/oral skills.

  • Excellent verbal and written English skills.

  • Strong organizational skills and ability to prioritize/multi-task easily with limited supervision.

  • Candidates should have a minimum of a four-year degree from an accredited university or college in Computer Science, or Electrical or Computer Engineering.

  • Industry-standard Linux certifications.


Ways to stand out of a crowd:

  • InfiniBand experience.

  • Experience with GPU focused hardware/software.

  • Experience with MPI.

  • Automation tooling background (Ansible, Salt, Puppet etc.).

  • Ethernet and Storage technologies.

Get referred with Mevi

Upload Resume

Get Referred with Mevi

Have you applied to this company in the past 6 months?
Upload Resume
Upload supported file (Max 15MB)

Thanks for applying!

bottom of page
{.body .main-container h1, .body .main-container h2, .body .main-container h3, .body .main-container h4, .body .main-container h5, .body .main-container h6 { margin-bottom: 20px !important;}