Accessing Clusters#

NVIDIA internally uses Slurm clusters to launch AI workloads and has several different Slurm clusters. To join a cluster, use the portal at https://csrg.nvidia.com/onboarding/requests

  1. First, create a Unix username if you haven’t already done so by going to https://itss.nvidia.com/#/accounts/unix

  2. After the username is created, go to https://csrg.nvidia.com/onboarding/user/requests

  3. Search for your PPP and request access.

  4. Once access is approved by the PIC, you will receive login node details.

  • Head Node Access: SSH into the head node with proper access rights. Load Kubernetes and Slurm modules to access the local cluster. Access to the head node must be requested in the support Slack channel for the cluster.

  • Login Node Access: SSH into the login node with proper access rights. This is the node from which users submit jobs to the cluster through the Slurm scheduler.

    $ ssh <unix username>@<login node>
    
  • Worker Node Access:

    Once you are on the login node, you can SSH to the worker nodes using their names. Sometimes SSH access to worker nodes is restricted.

    For example:

    $ ssh worker-2
    
  • Kubernetes Access: Use kubectl commands to manage the Kubernetes cluster where ARE is installed.

    Example of ARE pods running in the cluster:

    kubectl get pods -n heimdall
    
    heimdall-anomaly-detection-isolator-7c5bdc8fd4-kqxzj          1/1     Running           0          11d
    heimdall-anomaly-remediation-manager-fd94f9db8-g2ptp          1/1     Running           0          11d
    heimdall-attribution-service-68958848bd-45hxr                 1/1     Running           0          11d
    heimdall-dashboard-7556b6fcfb-9jsvr                           1/1     Running           0          11d
    heimdall-grafana-64546b66d7-h227j                             1/1     Running           0          11d
    heimdall-job-manager-764c8d8b87-748mw                         1/1     Running           2 (11d ago)    11d
    heimdall-job-manager-764c8d8b87-9qtgc                         1/1     Running           2 (11d ago)    11d
    heimdall-job-manager-764c8d8b87-ddbn7                         1/1     Running           1 (11d ago)    11d
    heimdall-job-manager-764c8d8b87-kkdtb                         1/1     Running           2 (11d ago)    11d
    heimdall-job-manager-764c8d8b87-tpq7l                         1/1     Running           2 (11d ago)    11d
    heimdall-kpis-service-7c54886fb4-cmdp6                        1/1     Running           0          11d
    heimdall-mysql-0                                              1/1     Running           0          11d
    heimdall-notifications-66cfd95776-fftpb                       1/1     Running           0          11d
    heimdall-rabbitmq-0                                           1/1     Running           0          11d
    heimdall-watcher-847c68bd56-xr6k2                             1/1     Running           0          7d21h
    heimdall-workflow-controller-86656fcff8-t9xgk                 3/3     Running           24 (17h ago)   11d