Using NVSM#
For a maintenance task, or to complete a health analysis, you can now run NVSM remotely inside the deployed containers on one of the DGX worker nodes.
Here is the general NVSM workflow:
List all NVSM pod instances and the corresponding worker nodes to find the pod name associated with a specific DGX.
oc get pods -n nvidia-nvsm -o wide NAME READY STATUS RESTARTS AGE IP NODE ... nvidia-nvsm-d9d9t 1/1 Running 1 8h 10.128.2.11 worker-0 ... nvidia-nvsm-tt8g5 1/1 Running 1 8h 10.131.0.11 worker-1 ...
Use the
oc exec
command to start an interactive shell in the container that is running on that system.oc exec -it <pod-name> -n nvidia-nvsm -- /bin/bash
You can now use one of the following main NVSM commands.
Note
When you execute NVSM, it can take a couple of minutes to collect system information.
To print the software and firmware versions of the DGX system:
nvsm show version
To provide a summary of the system health:
nvsm show health
To create a snapshot of the system components for offline analysis and diagnosis:
nvsm dump health
This command generates a
tar-file
in the/tmp
directory This command generates a tar-file in the/tmp
directory. (see Retrieving Health Information for more information).
Exit the interactive shell:
nvsm dump health
Retrieving Health Information#
This section describes the steps to generate and retrieve health information to debug a system issue offline or when requested by the NVIDIA Enterprise Support organization.
List all NVSM pod instances and the corresponding worker nodes to find the pod name that is associated with a DGX system.
oc get pods -n nvidia-nvsm -o wide NAME READY STATUS RESTARTS AGE IP NODE ... nvidia-nvsm-d9d9t 1/1 Running 1 8h 10.128.2.11 worker-0 ... nvidia-nvsm-tt8g5 1/1 Running 1 8h 10.131.0.11 worker-1 ...
Start an interactive shell in the NVSM pod of the corresponding DGX worker node.
POD is the name of the pd from the list in the table in step 1.
oc exec -it <pod-name> -n nvidia-nvsm -- /bin/bash
Here is an example:
oc exec -it nvidia-nvsm-d9d9t -n nvidia-nvsm -- /bin/bash
Create the NVSM snapshot file of all system components for offline analysis and diagnostics.
The file is created in the
/tmp
directory in the container.nvsm dump health Unable to find NVML library, Aborting. Health dump started This command will collect system configuration and diagnostic information to help diagnose the system. The output may contain data considered sensitive and should be reviewed before sending to any third party. Collecting 100% |████████████████████████████████████████| The output of this command is written to: /tmp/nvsm-health-nvidia-nvsm-d9d9t-20211211170039.tar.xz
Exit the container.
exit
Copy the snapshot file out of the container to the local host for further analysis or to send it to NVIDIA Enterprise Support.
oc cp nvidia-nvsm/POD:tmp/<snapshot-file> <target-file>
snapshot-file refers to the name of the generated file from step 3, andtarget-file refers to the name of the file on the local host. You need to replace these variables with actual names.
Delete the generated snapshot file in the NVSM pod.
oc exec -it POD -n nvidia-nvsm -- rm /tmp/<snapshot-file>
The generated file can now be used to debug and analyze system issues, or you can send the file to NVIDIA Enterprise Support.