Using NVSM

For a maintenance task, or to complete a health analysis, you can now run NVSM remotely inside the deployed containers on one of the DGX worker nodes.

Here is the general NVSM workflow:

  1. List all NVSM pod instances and the corresponding worker nodes to find the pod name associated with a specific DGX.

    oc get pods -n nvidia-nvsm -o wide
    NAME                READY   STATUS    RESTARTS   AGE   IP           NODE     ...
    nvidia-nvsm-d9d9t   1/1     Running   1          8h    10.128.2.11  worker-0 ...
    nvidia-nvsm-tt8g5   1/1     Running   1          8h    10.131.0.11  worker-1 ...
    
  2. Use the oc exec command to start an interactive shell in the container that is running on that system.

    oc exec -it <pod-name> -n nvidia-nvsm -- /bin/bash
    
  3. You can now use one of the following main NVSM commands.

    Note

    When you execute NVSM, it can take a couple of minutes to collect system information.

    • To print the software and firmware versions of the DGX system:

      nvsm show version
      
    • To provide a summary of the system health:

      nvsm show health
      
    • To create a snapshot of the system components for offline analysis and diagnosis:

      nvsm dump health
      

      This command generates a tar-file in the /tmp directory This command generates a tar-file in the /tmp directory. (see Retrieving Health Information for more information).

  4. Exit the interactive shell:

    nvsm dump health
    

Retrieving Health Information

This section describes the steps to generate and retrieve health information to debug a system issue offline or when requested by the NVIDIA Enterprise Support organization.

  1. List all NVSM pod instances and the corresponding worker nodes to find the pod name that is associated with a DGX system.

    oc get pods -n nvidia-nvsm -o wide
    
    NAME              READY STATUS  RESTARTS  AGE  IP          NODE     ...
    nvidia-nvsm-d9d9t 1/1   Running 1         8h   10.128.2.11 worker-0 ...
    nvidia-nvsm-tt8g5 1/1   Running 1         8h   10.131.0.11 worker-1 ...
    
  2. Start an interactive shell in the NVSM pod of the corresponding DGX worker node.

    POD is the name of the pd from the list in the table in step 1.

    oc exec -it <pod-name> -n nvidia-nvsm -- /bin/bash
    

    Here is an example:

    oc exec -it nvidia-nvsm-d9d9t -n nvidia-nvsm -- /bin/bash
    
  3. Create the NVSM snapshot file of all system components for offline analysis and diagnostics.

    The file is created in the /tmp directory in the container.

    nvsm dump health
    
    Unable to find NVML library, Aborting.
    
    Health dump started
      This command will collect system configuration and diagnostic information to help diagnose the system.
      The output may contain data considered sensitive and should be reviewed before sending to any third party.
    
    Collecting 100% |████████████████████████████████████████|
    
    The output of this command is written to: /tmp/nvsm-health-nvidia-nvsm-d9d9t-20211211170039.tar.xz
    
  4. Exit the container.

    exit
    
  5. Copy the snapshot file out of the container to the local host for further analysis or to send it to NVIDIA Enterprise Support.

    oc cp nvidia-nvsm/POD:tmp/<snapshot-file> <target-file>
    

    snapshot-file refers to the name of the generated file from step 3, andtarget-file refers to the name of the file on the local host. You need to replace these variables with actual names.

  6. Delete the generated snapshot file in the NVSM pod.

    oc exec -it POD -n nvidia-nvsm -- rm /tmp/<snapshot-file>
    

The generated file can now be used to debug and analyze system issues, or you can send the file to NVIDIA Enterprise Support.