Update Individual Software Packages With BCM#

DGX BasePOD and SuperPOD administrators can update individual SW packages or components to address specific dependency requirements. The following are high level steps applicable to packages in this section:

  1. Update the package within the DGX OS image on the headnode.

  2. Verify the update(s) on one of the DGX nodes.

  3. Apply the updated DGX OS using imageupdate command within the Cluster Management Shell (cmsh) to all the DGX nodes.

Note

An exception is the GPU Driver updates which require a reboot.

CUDA Toolkit#

  1. Chroot to the DGX OS image used by the DGX node category. A best practice is to save a copy of the image in case you need to roll back to a prior DGX OS release / version / state. Save a copy by using clone image function within cmsh.

    root@demeter-headnode-01:~# cm-chroot-sw-img /cm/images/dgx-os-6.3.2-h100-image
    
  2. Run apt update to refresh the local package repository metadata (list of available upgradable packages and their versions).

    root@dgx-os-6:/# apt update
    
  3. Install the latest supported CUDA toolkit for the DGX OS version running on the DGX node. Refer to the for supported version(s). Following is an example updating CUDA Toolkit to 12.4.

    root@dgx-os-6:/# apt install cuda-toolkit-12-4
    
  4. Verify that the CUDA toolkit is now installed and then exit chroot.

    root@dgx-os-6:/# apt list --installed cuda-toolkit-12-4
    
  5. Update one of the DGX nodes to the updated DGX OS image and verify the update. Exit cmsh only after you see the “Provisioning completed” message.

    Note

    A reboot may be required if the image applied has different release and/or kernel version than the DGX node.

    root@demeter-headnode-01:~# cmsh
    [demeter-headnode-01]% device
    [demeter-headnode-01->device]% use dgx-01
    [demeter-headnode-01->device[dgx-01]]% imageupdate -w
    
  6. SSH to the DGX node.

    root@demeter-headnode-01:~# ssh dgx-01
    
  7. Check the CUDA compiler version.

    root@dgx-01:~# nvcc --version
    
  8. Use nvidia-smi command to display information about the installed GPUs, driver version, and CUDA version, confirming that the system recognizes the GPUs and the driver is functioning.

    root@dgx-01:~# nvidia-smi
    Mon Jun  2 23:24:46 2025
    +-----------------------------------------------------------------------------------------+
    | NVIDIA-SMI 550.163.01             Driver Version: 550.163.01     CUDA Version: 12.4     |
    |-----------------------------------------+------------------------+----------------------+
    | GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
    | Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
    |                                         |                        |               MIG M. |
    |=========================================+========================+======================|
    |   0  NVIDIA H100 80GB HBM3          On  |   00000000:1B:00.0 Off |                    0 |
    | N/A   29C    P0             70W /  700W |       1MiB /  81559MiB |      0%      Default |
    |                                         |                        |             Disabled |
    +-----------------------------------------+------------------------+----------------------+
    |   1  NVIDIA H100 80GB HBM3          On  |   00000000:43:00.0 Off |                    0 |
    | N/A   29C    P0             69W /  700W |       1MiB /  81559MiB |      0%      Default |
    |                                         |                        |             Disabled |
    +-----------------------------------------+------------------------+----------------------+
    |   2  NVIDIA H100 80GB HBM3          On  |   00000000:52:00.0 Off |                    0 |
    | N/A   30C    P0             69W /  700W |       1MiB /  81559MiB |      0%      Default |
    |                                         |                        |             Disabled |
    +-----------------------------------------+------------------------+----------------------+
    |   3  NVIDIA H100 80GB HBM3          On  |   00000000:61:00.0 Off |                    0 |
    | N/A   31C    P0             71W /  700W |       1MiB /  81559MiB |      0%      Default |
    |                                         |                        |             Disabled |
    +-----------------------------------------+------------------------+----------------------+
    |   4  NVIDIA H100 80GB HBM3          On  |   00000000:9D:00.0 Off |                    0 |
    | N/A   31C    P0             71W /  700W |       1MiB /  81559MiB |      0%      Default |
    |                                         |                        |             Disabled |
    +-----------------------------------------+------------------------+----------------------+
    |   5  NVIDIA H100 80GB HBM3          On  |   00000000:C3:00.0 Off |                    0 |
    | N/A   29C    P0             70W /  700W |       1MiB /  81559MiB |      0%      Default |
    |                                         |                        |             Disabled |
    +-----------------------------------------+------------------------+----------------------+
    |   6  NVIDIA H100 80GB HBM3          On  |   00000000:D1:00.0 Off |                    0 |
    | N/A   31C    P0             73W /  700W |       1MiB /  81559MiB |      0%      Default |
    |                                         |                        |             Disabled |
    +-----------------------------------------+------------------------+----------------------+
    |   7  NVIDIA H100 80GB HBM3          On  |   00000000:DF:00.0 Off |                    0 |
    | N/A   34C    P0             71W /  700W |       1MiB /  81559MiB |      0%      Default |
    |                                         |                        |             Disabled |
    +-----------------------------------------+------------------------+----------------------+
    
    +-----------------------------------------------------------------------------------------+
    | Processes:                                                                              |
    |  GPU   GI   CI        PID   Type   Process name                              GPU Memory |
    |        ID   ID                                                               Usage      |
    |=========================================================================================|
    |  No running processes found                                                             |
    +-----------------------------------------------------------------------------------------+
    
  9. To verify CUDA functionality, you can download CUDA sample directory from NVIDIA’s GitHub repository, build and run a sample to verify CUDA is working properly.

    root@dgx-01:~# git clone https://github.com/nvidia/cuda-samples.git
    
  10. Navigate to one of the CUDA samples directories, such as deviceQuery.

    root@dgx-01:~# cd cuda-samples/Samples/1_Utilities/deviceQuery
    
  11. Build the testing sample using CMake, then run the sample.

    root@dgx-01:~/cuda-samples/Samples/1_Utilities/deviceQuery# mkdir build && cd build
    root@dgx-01:~/cuda-samples/Samples/1_Utilities/deviceQuery/build# cmake ..
    root@dgx-01:~/cuda-samples/Samples/1_Utilities/deviceQuery/build# make -j$(nproc)
    root@dgx-01:~/cuda-samples/Samples/1_Utilities/deviceQuery/build# ./deviceQuery
    ./deviceQuery Starting...
    
    CUDA Device Query (Runtime API) version (CUDART static linking)
    
    Detected 8 CUDA Capable device(s)
    
    Device 0: "NVIDIA H100 80GB HBM3"
    CUDA Driver Version / Runtime Version          12.4 / 12.4
    CUDA Capability Major/Minor version number:    9.0
    .
    [output truncated]
    .
    > Peer access from NVIDIA H100 80GB HBM3 (GPU7) -> NVIDIA H100 80GB HBM3 (GPU3) : Yes
    > Peer access from NVIDIA H100 80GB HBM3 (GPU7) -> NVIDIA H100 80GB HBM3 (GPU4) : Yes
    > Peer access from NVIDIA H100 80GB HBM3 (GPU7) -> NVIDIA H100 80GB HBM3 (GPU5) : Yes
    > Peer access from NVIDIA H100 80GB HBM3 (GPU7) -> NVIDIA H100 80GB HBM3 (GPU6) : Yes
    
    deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 12.4, CUDA Runtime Version = 12.4, NumDevs = 8
    Result = PASS
    
  12. After verifying CUDA is working properly, apply the updated DGX OS image to the remaining DGX nodes (this assumes that the updated image is being used by the DGX nodes category). Exit cmsh only after you see the “Provisioning completed” message.

    root@demeter-headnode-01:~# cmsh
    [demeter-headnode-01]% device
    [demeter-headnode-01->device]% imageupdate -w -c dgx-h100
    

DCGM#

If updating from a DGX OS release earlier than 6.3.2, you must manually upgrade the datacenter-gpu-manager package from version 3.x to version 4.x. Refer to the instructions in the Installation section of the DCGM documentation. The steps below are best practices when using a chroot environment.

  1. Before updating DCGM, make sure existing datacenter GPU manager system services are stopped.

    root@demeter-headnode-01:~# pdsh -w dgx-[01-31] systemctl stop nvidia-dcgm
    
  2. Chroot to the DGX OS image on the headnode that is being used by the DGX nodes. It is always a best practice to save a copy of the image in case you need to roll back to a prior DGX OS release / version / state. You can save a copy by using clone image function within cmsh.

    root@demeter-headnode-01:~# cm-chroot-sw-img /cm/images/dgx-os-6.3.2-h100-image
    
  3. On the image, check the current DCGM version.

    root@dgx-os-6:/# dcgmi --version
    
    dcgmi  version: 3.1.8
    
  4. Remove any installations of the datacenter-gpu-manager and datacenter-gpu-manager-config packages.

    root@dgx-os-6:/# dpkg --list datacenter-gpu-manager &> /dev/null && apt purge --yes datacenter-gpu-manager
    root@dgx-os-6:/# dpkg --list datacenter-gpu-manager-config &> /dev/null && apt purge --yes datacenter-gpu-manager-config
    
  5. Update the package registry cache.

    root@dgx-os-6:/# apt update
    
  6. Install the datacenter-gpu-manager-4 package corresponding to the system CUDA version. You can verify the CUDA version installed in the cloned image by issuing the following command, in this case the CUDA version is 12.

    root@dgx-os-6:/# ls /usr/local/ | grep cuda
    root@dgx-os-6:/# apt install --yes --install-recommends datacenter-gpu-manager-4-cuda12
    
  7. Verify that the version of DCGMI is updated and then exit chroot.

    root@dgx-os-6:/# dcgmi --version
    
    dcgmi  version: 4.2.3
    
  8. Apply the updated DGX OS image to one of the DGX nodes to validate DCGM functionality. Exit cmsh only after you see the “Provisioning completed” message.

    Note

    A reboot may be required if the image applied has different release and/or kernel version than the DGX node.

    root@demeter-headnode-01:~# cmsh
    [demeter-headnode-01]% device
    [demeter-headnode-01->device]% imageupdate -w dgx-01
    
  9. SSH to the DGX node. Verify DCGM is active.

    root@dgx-01:~# systemctl status nvidia-dcgm
    ● nvidia-dcgm.service - NVIDIA DCGM service
    Loaded: loaded (/lib/systemd/system/nvidia-dcgm.service; enabled; vendor preset: enabled)
    Active: active (running) since Tue 2025-06-03 21:45:34 PDT; 39min ago
    Main PID: 82283 (nv-hostengine)
    Tasks: 8 (limit: 629145)
    Memory: 65.4M
            CPU: 22min 25.362s
    CGroup: /system.slice/nvidia-dcgm.service
            └─82283 /usr/bin/nv-hostengine -n --service-account nvidia-dcgm
    
    Jun 03 21:45:34 dgx-01 systemd[1]: Started NVIDIA DCGM service.
    Jun 03 21:45:36 dgx-01 nv-hostengine[82283]: DCGM initialized
    Jun 03 21:45:36 dgx-01 nv-hostengine[82283]: Started host engine version 4.2.3 using port number: 5555
    
  10. To verify DCGM functionality, use dcgmi to query the system. You should see a listing of all supported GPUs (and any NVSwitches) found in the system.

    root@dgx-01:~# dcgmi discovery -l
    8 GPUs found.
    +--------+----------------------------------------------------------------------+
    | GPU ID | Device Information                                                   |
    +--------+----------------------------------------------------------------------+
    | 0      | Name: NVIDIA H100 80GB HBM3                                          |
    |        | PCI Bus ID: 00000000:1B:00.0                                         |
    |        | Device UUID: GPU-1c982352-da78-c318-7424-27271347284e                |
    +--------+----------------------------------------------------------------------+
    | 1      | Name: NVIDIA H100 80GB HBM3                                          |
    |        | PCI Bus ID: 00000000:43:00.0                                         |
    |        | Device UUID: GPU-4247ca58-0e26-a18a-780e-5a01bffb8630                |
    +--------+----------------------------------------------------------------------+
    | 2      | Name: NVIDIA H100 80GB HBM3                                          |
    |        | PCI Bus ID: 00000000:52:00.0                                         |
    |        | Device UUID: GPU-5adb0f97-f5aa-c51d-7e02-6139a9a62f7f                |
    +--------+----------------------------------------------------------------------+
    | 3      | Name: NVIDIA H100 80GB HBM3                                          |
    |        | PCI Bus ID: 00000000:61:00.0                                         |
    |        | Device UUID: GPU-1da404d8-6973-4e78-4f8e-fa9334193c6c                |
    +--------+----------------------------------------------------------------------+
    | 4      | Name: NVIDIA H100 80GB HBM3                                          |
    |        | PCI Bus ID: 00000000:9D:00.0                                         |
    |        | Device UUID: GPU-3505b245-a831-c969-83e3-15f53ba5c109                |
    +--------+----------------------------------------------------------------------+
    | 5      | Name: NVIDIA H100 80GB HBM3                                          |
    |        | PCI Bus ID: 00000000:C3:00.0                                         |
    |        | Device UUID: GPU-f91c8a55-9f7c-e9b8-f4cc-ea402d8d2fc8                |
    +--------+----------------------------------------------------------------------+
    | 6      | Name: NVIDIA H100 80GB HBM3                                          |
    |        | PCI Bus ID: 00000000:D1:00.0                                         |
    |        | Device UUID: GPU-7293efb3-9d26-53dd-cee2-5c2f10426b70                |
    +--------+----------------------------------------------------------------------+
    | 7      | Name: NVIDIA H100 80GB HBM3                                          |
    |        | PCI Bus ID: 00000000:DF:00.0                                         |
    |        | Device UUID: GPU-e4e0bb86-436a-c346-a73d-be11539c0d34                |
    +--------+----------------------------------------------------------------------+
    4 NvSwitches found.
    +-----------+
    | Switch ID |
    +-----------+
    | 0         |
    | 2         |
    | 3         |
    | 1         |
    +-----------+
    0 ConnectX found.
    +----------+
    | ConnectX |
    +----------+
    +----------+
    0 CPUs found.
    +--------+----------------------------------------------------------------------+
    | CPU ID | Device Information                                                   |
    +--------+----------------------------------------------------------------------+
    +--------+----------------------------------------------------------------------+
    
  11. After verifying DCGM is working properly, exit from the testing DGX node. On the headnode, use cmsh to apply the updated DGX OS image to the remaining DGX nodes (this assumes that the updated image is being used by the DGX nodes category). Exit cmsh only after you see the “Provisioning completed” message.

    root@demeter-headnode-01:~# cmsh
    [demeter-headnode-01]% device
    [demeter-headnode-01->device]% imageupdate -w -c dgx-h100
    

Enroot#

  1. Enroot and enroot+caps packages are part of the BCM software image since release 10.23.10.

  2. You can update enroot on the headnodes by using apt on the headnodes.

    root@demeter-headnode-01:~# apt update
    root@demeter-headnode-01:~# apt install enroot enroot+caps
    
  3. There are two methods to update the enroot environment. Either utilizing “apt update” (3a) or obtaining the package directly from the repo (3b).

    1. To update enroot for the DGX nodes, chroot to the DGX OS image on the headnode and use apt to update enroot. Exit chroot when enroot is updated.

      root@demeter-headnode-01:~# cm-chroot-sw-img /cm/images/dgx-os-6.3.2-h100-image/
      root@dgx-os-6:/# enroot version
      3.4.1
      root@dgx-os-6:/# apt update
      root@dgx-os-6:/# apt list --upgradable | grep enroot*
      
      WARNING: apt does not have a stable CLI interface. Use with caution in scripts.
      
      enroot+caps/BCM 10.0 3.5.0-100008-cm10.0-07e3dbc1dd amd64 [upgradable from: 3.4.1-100005-cm10.0-dde153f138]
      enroot/BCM 10.0 3.5.0-100008-cm10.0-07e3dbc1dd amd64 [upgradable from: 3.4.1-100005-cm10.0-dde153f138]
      root@dgx-os-6:/# apt install enroot enroot+caps
      root@dgx-os-6:/# enroot version
      3.5.0
      
    2. To update enroot from the NVIDIA repository begin by downloading the preferred enroot version.

      root@demeter-headnode-01:~# cp ./enroot_3.5.0-1_amd64.deb /cm/images/dgx-os-6.3.2-h100-image
      root@demeter-headnode-01:~# cp ./enroot+caps_3.5.0-1_amd64.deb /cm/images/dgx-os-6.3.2-h100-image
      root@demeter-headnode-01:~# cm-chroot-sw-img /cm/images/dgx-os-6.3.2-h100-image/
      root@dgx-os-6:/# apt install ./enroot+caps_3.5.0-1_amd64.deb ./enroot_3.5.0-1_amd64.deb
      
  4. Apply the updated DGX OS image to the DGX nodes within cmsh. Exit cmsh only after you see the “Provisioning completed” message.

    root@demeter-headnode-01:~# cmsh
    [demeter-headnode-01]% device
    [demeter-headnode-01->device]% imageupdate -w -c dgx-h100
    

GPU Driver#

  1. Identify the validated GPU Driver branch and version for your DGX architecture and DGX OS release from the DGX SuperPOD Release Notes. In this example we are updating (and changing) the GPU driver from branch 535 to branch 550. More information about changing GPU driver branch can be found here. For GPU driver update within the same branch ‘apt update/upgrade’ will cover the update.

  2. Chroot into the DGX OS image being used by the DGXs on the headnode. Verify the installed GPU Driver branch and version.

    root@demeter-headnode-01:~# cm-chroot-sw-img /cm/images/dgx-os-6.3.2-h100-image/
    root@dgx-os-6:/# apt list --installed nvidia-driver*server
    Listing... Done
    nvidia-driver-535-server/jammy-updates,jammy-security,now 535.247.01-0ubuntu0.22.04.1 amd64 [installed]
    
  3. Use ‘apt-mark unhold’ to unhold any packages that should be automatically updated, such as the linux kernel and headers.

    Note

    If the MLNX OFED packages were deployed using the BCM repository, these will be on hold and prevent kernel updates.

    root@dgx-os-6:/# apt-mark unhold linux-*
    
  4. Install the latest DGX kernel version. If you were prompted a screen about newer kernel is available, select <OK> to continue.

    root@dgx-os-6:/# apt install -y linux-nvidia
    
  5. Update the local package database.

    root@dgx-os-6:/# apt update
    root@dgx-os-6:/# apt list nvidia-driver*server
    Listing... Done
    nvidia-driver-418-server/jammy-updates,jammy-security 418.226.00-0ubuntu5~0.22.04.1 amd64
    nvidia-driver-440-server/jammy-updates,jammy-security 450.248.02-0ubuntu0.22.04.1 amd64
    nvidia-driver-450-server/jammy-updates,jammy-security 450.248.02-0ubuntu0.22.04.1 amd64
    nvidia-driver-460-server/jammy-updates,jammy-security 470.256.02-0ubuntu0.22.04.1 amd64
    nvidia-driver-470-server/jammy-updates,jammy-security 470.256.02-0ubuntu0.22.04.1 amd64
    nvidia-driver-510-server/jammy-updates,jammy-security 515.105.01-0ubuntu0.22.04.1 amd64
    nvidia-driver-515-server/jammy-updates,jammy-security 525.147.05-0ubuntu2.22.04.1 amd64
    nvidia-driver-525-server/jammy-updates,jammy-security 525.147.05-0ubuntu2.22.04.1 amd64
    nvidia-driver-535-server/jammy-updates,jammy-security,now 535.247.01-0ubuntu0.22.04.1 amd64 [installed]
    nvidia-driver-550-server/jammy-updates,jammy-security 550.163.01-0ubuntu0.22.04.1 amd64
    nvidia-driver-565-server/jammy-updates 565.57.01-0ubuntu0.22.04.4 amd64
    nvidia-driver-570-server/jammy-updates,jammy-security 570.133.20-0ubuntu0.22.04.1 amd64
    
  6. First check the packages installation (with the –dry-run option) and then install the NVIDIA GPU driver (without the –dry-run option). Replace the release version used as an example (550) with the release you want to install.

    root@dgx-os-6:/# apt install -y nvidia-driver-550-server linux-modules-nvidia-550-server-nvidia libnvidia-nscq-550 nvidia-modprobe nvidia-fabricmanager-550 nv-persistence-mode --dry-run
    root@dgx-os-6:/# apt install -y nvidia-driver-550-server linux-modules-nvidia-550-server-nvidia libnvidia-nscq-550 nvidia-modprobe nvidia-fabricmanager-550 nv-persistence-mode
    
  7. Verify the GPU driver branch and version installed and then exit chroot.

    root@dgx-os-6:/# apt list --installed | grep nvidia-driver
    
    WARNING: apt does not have a stable CLI interface. Use with caution in scripts.
    
    nvidia-driver-550-server/jammy-updates,jammy-security,now 550.163.01-0ubuntu0.22.04.1 amd64 [installed]
    
  8. Select the updated DGX kernel to be used for the updated DGX OS image in cmsh. Wait until you see the “Initial ramdisk for image dgx-os-6.3.2-h100-image was generated successfully” message before exiting cmsh.

    Note

    Please type set kernelversion and then hit the Tab key twice for tab completion to select the updated version.

    root@demeter-headnode-01:~# cmsh
    [demeter-headnode-01]% softwareimage
    [demeter-headnode-01->softwareimage]% use dgx-os-6.3.2-h100-image
    [demeter-headnode-01->softwareimage[dgx-os-6.3.2-h100-image]]% set kernelversion 5.15.0-1078-nvidia
    [demeter-headnode-01->softwareimage*[dgx-os-6.3.2-h100-image*]]% commit
    Sun Jun  1 22:15:28 2025 [notice] demeter-headnode-01: Initial ramdisk for image dgx-os-6.3.2-h100-image is being generated
    Sun Jun  1 22:16:15 2025 [notice] demeter-headnode-01: Initial ramdisk for image dgx-os-6.3.2-h100-image was generated successfully
    
  9. Apply the updated DGX OS image to one of the DGX nodes and reboot the DGX node as a new kernel has been updated.

    root@demeter-headnode-01:~# cmsh
    [demeter-headnode-01]% device
    [demeter-headnode-01->device]% use dgx-01
    [demeter-headnode-01->device[dgx-01]]% reboot
    
  10. Once the DGX node has rebooted, SSH to it, run nvidia-smi to verify the GPU Driver branch and version installed.

    root@demeter-headnode-01:~# ssh dgx-01
    root@dgx-01:~# nvidia-smi
    Wed Jun  4 23:47:57 2025
    +-----------------------------------------------------------------------------------------+
    | NVIDIA-SMI 550.163.01             Driver Version: 550.163.01     CUDA Version: 12.4     |
    |-----------------------------------------+------------------------+----------------------+
    | GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
    | Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
    |                                         |                        |               MIG M. |
    |=========================================+========================+======================|
    |   0  NVIDIA H100 80GB HBM3          On  |   00000000:1B:00.0 Off |                    0 |
    | N/A   28C    P0             70W /  700W |       1MiB /  81559MiB |      0%      Default |
    |                                         |                        |             Disabled |
    +-----------------------------------------+------------------------+----------------------+
    |   1  NVIDIA H100 80GB HBM3          On  |   00000000:43:00.0 Off |                    0 |
    | N/A   28C    P0             69W /  700W |       1MiB /  81559MiB |      0%      Default |
    |                                         |                        |             Disabled |
    +-----------------------------------------+------------------------+----------------------+
    |   2  NVIDIA H100 80GB HBM3          On  |   00000000:52:00.0 Off |                    0 |
    | N/A   31C    P0             69W /  700W |       1MiB /  81559MiB |      0%      Default |
    |                                         |                        |             Disabled |
    +-----------------------------------------+------------------------+----------------------+
    |   3  NVIDIA H100 80GB HBM3          On  |   00000000:61:00.0 Off |                    0 |
    | N/A   31C    P0             71W /  700W |       1MiB /  81559MiB |      0%      Default |
    |                                         |                        |             Disabled |
    +-----------------------------------------+------------------------+----------------------+
    |   4  NVIDIA H100 80GB HBM3          On  |   00000000:9D:00.0 Off |                    0 |
    | N/A   29C    P0             71W /  700W |       1MiB /  81559MiB |      0%      Default |
    |                                         |                        |             Disabled |
    +-----------------------------------------+------------------------+----------------------+
    |   5  NVIDIA H100 80GB HBM3          On  |   00000000:C3:00.0 Off |                    0 |
    | N/A   27C    P0             70W /  700W |       1MiB /  81559MiB |      0%      Default |
    |                                         |                        |             Disabled |
    +-----------------------------------------+------------------------+----------------------+
    |   6  NVIDIA H100 80GB HBM3          On  |   00000000:D1:00.0 Off |                    0 |
    | N/A   30C    P0             72W /  700W |       1MiB /  81559MiB |      0%      Default |
    |                                         |                        |             Disabled |
    +-----------------------------------------+------------------------+----------------------+
    |   7  NVIDIA H100 80GB HBM3          On  |   00000000:DF:00.0 Off |                    0 |
    | N/A   33C    P0             70W /  700W |       1MiB /  81559MiB |      0%      Default |
    |                                         |                        |             Disabled |
    +-----------------------------------------+------------------------+----------------------+
    
    +-----------------------------------------------------------------------------------------+
    | Processes:                                                                              |
    |  GPU   GI   CI        PID   Type   Process name                              GPU Memory |
    |        ID   ID                                                               Usage      |
    |=========================================================================================|
    |  No running processes found                                                             |
    +-----------------------------------------------------------------------------------------+
    
  11. Log out from the DGX node. Reboot the remaining DGX nodes within cmsh (assumes that all the DGX nodes are set to use the updated DGX OS image).

    root@demeter-headnode-01:~# cmsh
    [demeter-headnode-01]% device
    [demeter-headnode-01->device]% reboot -c dgx-h100
    

MOFED (Mellanox OFED) Managed by BCM – (Preferred Method)#

NVIDIA recommends DGX BasePOD and SuperPOD customers to update the latest MOFED drivers using BCM software that are managed by through the BCM repository and the following steps. Consider the transition to DOCA OFED outlined in the next section.

Update steps via BCM: (DGX OS 6)

  1. Install the desired MOFED package from the BCM repository.

    apt update && apt install mlnx-ofed24.10 -y
    
  2. Check if the desired kernel version is selected on the image which will have the MLNX package installed and set it if not already done. The package will build the kernel modules against this version.

    root@demeter-headnode-01:~# cmsh
    [demeter-headnode-01]% softwareimage
    [demeter-headnode-01->softwareimage]% use dgx-os-6.3.2-h100-image
    [demeter-headnode-01->softwareimage[dgx-os-6.3.2-h100-image]]% get kernelversion
    5.15.0-1046-nvidia
    [demeter-headnode-01->softwareimage[dgx-os-6.3.2-h100-image]]% set kernelversion 5.15.0-1078-nvidia
    [demeter-headnode-01->softwareimage*[dgx-os-6.3.2-h100-image*]]% commit
    Sun Jun  1 22:15:28 2025 [notice] demeter-headnode-01: Initial ramdisk for image dgx-os-6.3.2-h100-image is being generated
    Sun Jun  1 22:16:15 2025 [notice] demeter-headnode-01: Initial ramdisk for image dgx-os-6.3.2-h100-image was generated successfully
    [demeter-headnode-01->softwareimage[dgx-os-6.3.2-h100-image]]% quit
    
  3. Install the MLNX Package onto the image.

    root@demeter-headnode-01:~# /cm/local/apps/mlnx-ofed24.10/current/bin/mlnx-ofed24.10-install.sh -s dgx-os-6.3.2-h100-image
    
    Mellanox OFED installation, version: 24.10-2.1.8.0 for x86_64.
    On " dgx-os-6.3.2-h100-image " software image, for kernel version: 5.15.0-1078-nvidia.
    Log file: /var/log/cm-ofed.log
    
    Package directory: /cm/local/apps/mlnx-ofed24.10/24.10-2.1.8.0-ubuntu22.04/DEBS/x86_64
    
    removing: dapl2-utils ibacm ibsim-utils ibutils ibverbs-providers ibverbs-providers:amd64 ibverbs-utils infiniband-diags libdapl2 libibdm1 libibmad5 libibmad5:amd64 libibnetdisc5 libibnetdisc5:amd64 libibumad3 libibverbs1 libipathverbs1 libmlx4-1 libmlx5-1 libmthca1 libopensm2 libopensm9 libosmcomp5 libosmvendor5 librdmacm1 librdmacm1:amd64 libumad2sim0 mstflint opensm openvswitch-switch perftest rdmacm-utils rdma-core srptools
    purging:  opensm infiniband-diags infiniband-diags srptools libosmvendor5 infiniband-diags ibacm srptools libdapl2 ibverbs-providers:amd64 ibacm opensm libosmvendor5 opensm libosmvendor5 libopensm9 opensm libdapl2 srptools ibacm
    purging: rdma-core
    
    Package directory: /cm/local/apps/mlnx-ofed24.10/24.10-2.1.8.0-ubuntu22.04/DEBS/x86_64
    installing: /cm/local/apps/mlnx-ofed24.10/24.10-2.1.8.0-ubuntu22.04/DEBS/x86_64/ofed-scripts_24.10.OFED.24.10.2.1.8-1_amd64.deb
    installing: /cm/local/apps/mlnx-ofed24.10/24.10-2.1.8.0-ubuntu22.04/DEBS/x86_64/mlnx-tools_24.10-0.2410068_amd64.deb
    installing: /cm/local/apps/mlnx-ofed24.10/24.10-2.1.8.0-ubuntu22.04/DEBS/x86_64/mlnx-ofed-kernel-utils_24.10.OFED.24.10.2.1.8.1-1_amd64.deb
    installing: /cm/local/apps/mlnx-ofed24.10/24.10-2.1.8.0-ubuntu22.04/DEBS/x86_64/mlnx-ofed-kernel-dkms_24.10.OFED.24.10.2.1.8.1-1_all.deb
    installing: /cm/local/apps/mlnx-ofed24.10/24.10-2.1.8.0-ubuntu22.04/DEBS/x86_64/iser-dkms_24.10.OFED.24.10.2.1.8.1-1_all.deb
    installing: /cm/local/apps/mlnx-ofed24.10/24.10-2.1.8.0-ubuntu22.04/DEBS/x86_64/isert-dkms_24.10.OFED.24.10.2.1.8.1-1_all.deb
    installing: /cm/local/apps/mlnx-ofed24.10/24.10-2.1.8.0-ubuntu22.04/DEBS/x86_64/srp-dkms_24.10.OFED.24.10.2.1.8.1-1_all.deb
    installing: /cm/local/apps/mlnx-ofed24.10/24.10-2.1.8.0-ubuntu22.04/DEBS/x86_64/mlnx-nfsrdma-dkms_24.10.OFED.24.10.2.1.8.1-1_all.deb
    installing: /cm/local/apps/mlnx-ofed24.10/24.10-2.1.8.0-ubuntu22.04/DEBS/x86_64/mlnx-nvme-dkms_24.10.OFED.24.10.2.1.8.1-1_all.deb
    installing: /cm/local/apps/mlnx-ofed24.10/24.10-2.1.8.0-ubuntu22.04/DEBS/x86_64/kernel-mft-dkms_4.30.1.113-1_all.deb
    installing: /cm/local/apps/mlnx-ofed24.10/24.10-2.1.8.0-ubuntu22.04/DEBS/x86_64/knem-dkms_1.1.4.90mlnx3-OFED.23.10.0.2.1.1_all.deb
    installing: /cm/local/apps/mlnx-ofed24.10/24.10-2.1.8.0-ubuntu22.04/DEBS/x86_64/xpmem-dkms_2.7.4-1.2410068_all.deb
    installing:  rdma-core:amd64 libibverbs1:amd64 ibverbs-utils:amd64 ibverbs-providers:amd64 libibverbs-dev:amd64 libibverbs1-dbg:amd64 libibumad3:amd64 libibumad-dev:amd64 ibacm:amd64 librdmacm1:amd64 rdmacm-utils:amd64 librdmacm-dev:amd64 ibdump:amd64 libibmad5:amd64 libibmad-dev:amd64 libopensm:amd64 opensm:amd64 opensm-doc:amd64 libopensm-devel:amd64 libibnetdisc5:amd64 infiniband-diags:amd64 mft:amd64 perftest:amd64 ibutils2:amd64 ibsim:amd64 ibsim-doc:all ucx:amd64 sharp:amd64 hcoll:amd64 knem:amd64 openmpi:all mpitests:amd64 xpmem:all libxpmem0:amd64 libxpmem-dev:amd64 dpcp:amd64 srptools:amd64 mlnx-ethtool:amd64 mlnx-iproute2:amd64 rshim:amd64 ibarr:amd64
    installing: /cm/local/apps/mlnx-ofed24.10/24.10-2.1.8.0-ubuntu22.04/DEBS/x86_64/mlnx-fw-updater_24.10-2.1.8.0_amd64.deb
    installing: /cm/local/apps/mlnx-ofed24.10/24.10-2.1.8.0-ubuntu22.04/DEBS/x86_64/mlnx-ofed24.10-modules_24.10.2.1.8.0-100005-cm10.0-574d108822_all.deb
    Update kernel module dependencies.
    Enable openibd service.
    marking package "linux-generic" as held back
    marking package "linux-headers-generic" as held back
    marking package "linux-image-generic" as held back
    
    Creating ramdisk image.
    Installed Mellanox OFED stack DEB packages on " dgx-os-6.3.2-h100-image " software image.
    
    Done.
    root@demeter-headnode-01:~#
    
  4. Check the installation from the image.

    root@demeter-headnode-01:~# cm-chroot-sw-img /cm/images/dgx-os-6.1-h100-image
    root@dgx-os-6.3.2-h100-image:/# ofed_info -s
    MLNX_OFED_LINUX-24.10-2.1.8.0:
    root@dgx-os-6.3.2-h100-image:/# apt list --installed | grep -i  ofed
    
    WARNING: apt does not have a stable CLI interface. Use with caution in scripts.
    
    iser-dkms/now 24.10.OFED.24.10.2.1.8.1-1 all [installed,local]
    isert-dkms/now 24.10.OFED.24.10.2.1.8.1-1 all [installed,local]
    knem-dkms/now 1.1.4.90mlnx3-OFED.23.10.0.2.1.1 all [installed,local]
    knem/now 1.1.4.90mlnx3-OFED.23.10.0.2.1.1 amd64 [installed,local]
    mlnx-nfsrdma-dkms/now 24.10.OFED.24.10.2.1.8.1-1 all [installed,local]
    mlnx-nvme-dkms/now 24.10.OFED.24.10.2.1.8.1-1 all [installed,local]
    mlnx-ofed-kernel-dkms/now 24.10.OFED.24.10.2.1.8.1-1 all [installed,local]
    mlnx-ofed-kernel-utils/now 24.10.OFED.24.10.2.1.8.1-1 amd64 [installed,local]
    mlnx-ofed24.10-modules/now 24.10.2.1.8.0-100005-cm10.0-574d108822 all [installed,local]
    ofed-scripts/now 24.10.OFED.24.10.2.1.8-1 amd64 [installed,local]
    srp-dkms/now 24.10.OFED.24.10.2.1.8.1-1 all [installed,local]
    root@dgx-os-6.3.2-h100-image:/# dkms status | grep ofed
    mlnx-ofed-kernel/24.10.OFED.24.10.2.1.8.1, 5.15.0-1078-nvidia, x86_64: installed
    root@dgx-os-6.3.2-h100-image:/# exit
    root@demeter-headnode-01:~#
    
  5. Reboot one of the DGX nodes to check the MLNX OFED updated properly and started correctly.

    root@demeter-headnode-01:~# pdsh -w dgx-01 reboot
    dgx-01: Connection to dgx-01 closed by remote host.
    
  6. Once verified the DGX node booted up with the updated DGX OS, proceed to reboot the remaining DGX nodes.

    root@demeter-headnode-01:~# pdsh -w dgx-[02-31] reboot
    

MOFED (Mellanox OFED) Not Managed by BCM#

NVIDIA recommends DGX BasePOD and SuperPOD customers to update the latest MOFED drivers using BCM software that are deployed directly on the images and the following steps. Consider the transition to DOCA OFED outlined in the next section.

Update steps via chroot: (DGX OS 6)

  1. Download newer package from Mellanox Repo <https://network.nvidia.com/products/infiniband-drivers/linux/mlnx_ofed/>`_: ( eg: MLNX_OFED_LINUX-23.10-0.5.5.0-ubuntu22.04-x86_64.tgz).

  2. Copy MLNX_OFED_LINUX-23.10-0.5.5.0-ubuntu22.04-x86_64.tgz to headnode.

    _images/mofednotmanagedbybcm2.png
  3. Check what software image is being used by the DGX nodes. In this case, the image used by the DGXs is ‘dgx-os-6.1-h100-image’ with the number in the “Nodes” column indicates how many nodes are using it. Note the Kernel version which will be used at a later step.

    root@demeter-headnode-01:~# cmsh -c 'softwareimage list'
    Name (key)             Path (key)                               Kernel version      Nodes
    ---------------------- ---------------------------------------- ------------------- --------
    default-image          /cm/images/default-image                 5.19.0-45-generic   0
    dgx-os-6.1-a100-image  /cm/images/dgx-os-6.1-a100-image         5.15.0-1042-nvidia  0
    dgx-os-6.1-h100-image  /cm/images/dgx-os-6.1-h100-image         5.15.0-1042-nvidia  31
    k8s-image              /cm/images/k8s-image                     5.19.0-45-generic   0
    
  4. Copy the file to the image directory.

    cp MLNX_OFED_LINUX-23.10-0.5.5.0-ubuntu22.04-x86_64.tgz /cm/images/dgx-os-6.1-h100-image/tmp/
    
  5. On the headnode, chroot to the target DGX OS image.

    cm-chroot-sw-img /cm/images/dgx-os-6.1-h100-image
    
    _images/5ad4efc58b792d556803232812c8f6426387f563.png
  6. Extract the files:

    cd /tmp
    tar -xzvf  MLNX_OFED_LINUX-23.10-0.5.5.0-ubuntu22.04-x86_64.tgz
    
    # Change into the new directory
    cd MLNX_OFED_LINUX-23.10-0.5.5.0-ubuntu22.04-x86_64
    
    _images/9d74ac8a29c17e2948f52866b9094402ed3a973b.png _images/525470de40b97d45e7fc38a3be863f41df788d85.png
  7. Uninstall the existing version.

    ./uninstall.sh
    
    _images/0f8b81bdd21ec6de714350a1ca48f748759ae554.png
  8. Install the new version specifying the kernel version used by the DGX nodes determined earlier.

    ./mlnxofedinstall --without-dkms --add-kernel-support --kernel 5.15.0-1042-nvidia --without-fw-update --force
    
    _images/4094d79d1b9b8a0024081baebcdf7c5c743bf9d3.png _images/945cb4b74f3b8d6c64bf6f9d4fdf3ecfc3a277fb.png _images/1e55869cf00a3cb6c3409925255c51d192dc267f.png _images/84e223a0aa07ca4c52af91ed49ed3f7d808fa9bb.png _images/f8c68bc339e13bb3bb50d3c5abfc97b2f32c545c.png

    If an error for ucx-cuda is encountered, do the following additional steps.

    _images/f80429109a46520c9275896fbcd954f43a08ad07.png
    # Clean up failed packages
    root@dgx-os-6:/# apt --fix-broken install
    
    # Install MOFED without the ucx-cuda package
    root@dgx-os-6:/# ./mlnxofedinstall --without-dkms --add-kernel-support --kernel 5.15.0-1042-nvidia --without-fw-update --force --without-ucx-cuda
    
    # Validate the version of ucx installed
    root@dgx-os-6:/# apt list ucx
    
    # Download the latest ucx-cuda version matching the version of ucx, in this case 1.16.
    # The latest 1.16 version removed the dependency encountered in the previous step.
    root@dgx-os-6:/# wget https://github.com/openucx/ucx/releases/download/v1.16.0/ucx-1.16.0-ubuntu22.04-mofed5-cuda12-x86_64.tar.bz2
    
    # Extract the files
    root@dgx-os-6:/# tar -xvjf ucx-1.16.0-ubuntu22.04-mofed5-cuda12-x86_64.tar.bz2
    
    # Install the ucx-cuda package
    root@dgx-os-6:/# dpkg -i ucx-cuda-1.16.0.deb
    
    # Validate the version of the ucx-cuda installed
    root@dgx-os-6:/# apt list ucx-cuda
    Listing… Done
    ucx-cuda/now 1.16.e4bb802 amd64 [installed,local]
    
  9. Validate the version on the image.

    root@dgx-os-6:/# ofed_info -s
    MLNX_OFED_LINUX-23.10-0.5.5.0:
    
  10. Reinstall additional packages for H100/H200 Systems then exit chroot.

    # H100 Based Systems
    root@dgx-os-6:/# apt install -y dgx-h100-system-configurations kdump-tools linux-crashdump nvidia-crashdump nvsm
    
    # H200 Based Systems
    root@dgx-os-6:/# apt install -y dgx-h200-system-configurations kdump-tools linux-crashdump nvidia-crashdump nvsm
    
    root@dgx-os-6:/# exit
    
  11. Create the ramdisk of the new image.

    root@demeter-headnode-01:~# cmsh
    [demeter-headnode-01]% softwareimage
    [demeter-headnode-01->softwareimage]% use dgx-os-6.1-h100-image
    [demeter-headnode-01->softwareimage[dgx-os-6.1-h100-image]]% createramdisk
    
    _images/a85c4498b5f2f232c22ea8a634fd145543eb982d.png
  12. Validate the new image is applied to the expected category of DGX nodes.

    root@demeter-headnode-01:~# cmsh
    [demeter-headnode-01]% category
    [demeter-headnode-01->category]% list
    
    _images/117413a1593a271f17bdd4e42c11fbe6779f9d35.png
  13. Reboot the node category to apply it system wide or reboot each individual node at a time.

    root@demeter-headnode-01:~# cmsh
    [demeter-headnode-01]% device
    [demeter-headnode-01->device]% reboot -c dgx-h100
    
  14. Verify the MOFED version is updated on the DGX nodes after the reboots are completed.

    pdsh -w dgx-[01-31] ofed_info -s | sort
    
    _images/59a291c1ac977b4fa4bb203bc05546475ed76df2.png

DOCA OFED Transition#

Since MOFED has now migrated to DOCA, NVIDIA recommends DGX BasePOD and SuperPod customers to update the latest DOCA drivers using BCM software and the following steps.

Update steps via BCM to the DGX OS 6

  1. Switch to the target image.

    cm-chroot-sw-img /cm/images/dgx-os-6.3.2-h100-image-DOCA/
    
    _images/89111afda2e4a080c10bda47df212a1ebd8b0d95.png
  2. Remove the existing MOFED or old DOCA versions following the instructions found on the DOCA Installation and Upgrade guide.

    # Uninstall DOCA packages
    for f in $( dpkg --list | grep -E 'doca|flexio|dpa-gdbserver|dpa-stats|dpaeumgmt' | awk '{print $2}' ); do echo $f ; sudo apt remove --purge $f -y ; done
    
    # Uninstall OFED packages
    /usr/sbin/ofed_uninstall.sh --force
    
    apt autoremove
    
    _images/673d5eed8aa714ad1c317d47245fc8a452065d88.png
  3. Change the CUDA repo preferences from the default of 580 to 480 as the newly added repo will have the default value of 500. Otherwise, the MFT package will install at an earlier version which is incompatible with the DOCA driver. This will be changed to default after the install procedure is completed.

    sed -i 's/^Pin-Priority: 580$/Pin-Priority: 480/' /etc/apt/preferences.d/cuda-compute-repo
    
    _images/846099a1fa481b4dd6e67507fc5eb20dabeb9d0b.png
  4. Go to the DOCA downloads <https://developer.nvidia.com/doca-2-9-3-download-archive>`_ page and select the appropriate OS and package. The instructions assume a local installer. The system here is based on DGX OS 6, which is an Ubuntu x86 version 22.04.

  5. By selecting the appropriate system and the deb(local) or the offline installer instruction will be displayed. The local mode is preferred since it will have all the packages necessary to manage the Connect-X based cards. Copy out the output into the BCM image.

  6. Once the installation is completed, validate the DOCA version and the driver versions are installed.

    ofed_info -s
    apt list --installed | grep -i doca
    apt list --installed | grep -i ofed
    
    _images/cfafe0b0d50bdf1dbafc3e4e75fbc28f9d3c5ee2.png
  7. Change the CUDA repo preferences back to the default.

    sed -i 's/^Pin-Priority: 480$/Pin-Priority: 580/' /etc/apt/preferences.d/cuda-compute-repo
    
    _images/dc6f0e60c78d3bbc1c1d8a8cb2c64541edbb1fa3.png
  8. Perform steps 10-14 in the MOFED section to apply and validate the new drivers.