Accessing Clusters#
NVIDIA internally uses Slurm clusters to launch AI workloads and has several different Slurm clusters. To join a cluster, use the portal at https://csrg.nvidia.com/onboarding/requests
First, create a Unix username if you haven’t already done so by going to https://itss.nvidia.com/#/accounts/unix
After the username is created, go to https://csrg.nvidia.com/onboarding/user/requests
Search for your PPP and request access.
Once access is approved by the PIC, you will receive login node details.
Head Node Access: SSH into the head node with proper access rights. Load Kubernetes and Slurm modules to access the local cluster. Access to the head node must be requested in the support Slack channel for the cluster.
Login Node Access: SSH into the login node with proper access rights. This is the node from which users submit jobs to the cluster through the Slurm scheduler.
$ ssh <unix username>@<login node>
Worker Node Access:
Once you are on the login node, you can SSH to the worker nodes using their names. Sometimes SSH access to worker nodes is restricted.
For example:
$ ssh worker-2
Kubernetes Access: Use
kubectl
commands to manage the Kubernetes cluster where ARE is installed.Example of ARE pods running in the cluster:
kubectl get pods -n heimdall heimdall-anomaly-detection-isolator-7c5bdc8fd4-kqxzj 1/1 Running 0 11d heimdall-anomaly-remediation-manager-fd94f9db8-g2ptp 1/1 Running 0 11d heimdall-attribution-service-68958848bd-45hxr 1/1 Running 0 11d heimdall-dashboard-7556b6fcfb-9jsvr 1/1 Running 0 11d heimdall-grafana-64546b66d7-h227j 1/1 Running 0 11d heimdall-job-manager-764c8d8b87-748mw 1/1 Running 2 (11d ago) 11d heimdall-job-manager-764c8d8b87-9qtgc 1/1 Running 2 (11d ago) 11d heimdall-job-manager-764c8d8b87-ddbn7 1/1 Running 1 (11d ago) 11d heimdall-job-manager-764c8d8b87-kkdtb 1/1 Running 2 (11d ago) 11d heimdall-job-manager-764c8d8b87-tpq7l 1/1 Running 2 (11d ago) 11d heimdall-kpis-service-7c54886fb4-cmdp6 1/1 Running 0 11d heimdall-mysql-0 1/1 Running 0 11d heimdall-notifications-66cfd95776-fftpb 1/1 Running 0 11d heimdall-rabbitmq-0 1/1 Running 0 11d heimdall-watcher-847c68bd56-xr6k2 1/1 Running 0 7d21h heimdall-workflow-controller-86656fcff8-t9xgk 3/3 Running 24 (17h ago) 11d