DGX OS#

It’s recommended that DGX A100/H100/H200 BasePOD and SuperPOD customers with BCM 10.x stay on the same DGX OS 6 (Ubuntu22) until the BCM 11 upgrade with DGX OS 7 (Ubuntu24) is supported.

Update DGX OS#

The update will include the DGX OS, DGX kernel, GPU drivers, CUDA toolkit, and DCGM. In this example, the update is from version 6.1.0 to 6.3.2 and includes associated software packages to the latest supported release.

The process outlined below should not have an impact to the system until each DGX node is rebooted and the new image is downloaded to the DGX node. The DGX node reboots can be scheduled in a controlled manner. References for the upgrading process can be found on the DGX OS User Guide.

  1. Prepare a copy of the DGX OS image. This example uses the dgx-os-6.1-h100-image-mofed software image as the base DGX image. Clone the DGX OS image to an image called dgx-os-6.3.2-h100-image.

    root@demeter-headnode-01:~# cmsh
    
    [demeter-headnode-01]% softwareimage
    [demeter-headnode-01->softwareimage]% clone dgx-os-6.1-h100-image-mofed dgx-os-6.3.2-h100-image
    [demeter-headnode-01->softwareimage*[dgx-os-6.3.2-h100-image*]]% commit
    [notice] demeter-headnode-01:Provisioning completed: sent demeter-headnode-01:/cm/images/dgx-os-6.3.2-h100-image to Demeter-headnode-02:/cm/images/dgx-os-6.3.2-h100-image, mode UPDATE, dry run = no
    
  2. Chroot to the cloned image to update the DGX OS.

    root@demeter-headnode-01:~# cm-chroot-sw-img /cm/images/dgx-os-6.3.2-h100-image/
    
  3. Verify the current DGX OS release and kernel version.

    root@dgx-os-6:/# cat /etc/dgx-release
    DGX_NAME="DGX Server"
    DGX_PRETTY_NAME="NVIDIADGX Server"
    DGX_SWBUILD_DATE="2023-08-09-12-30-10"
    DGX_SWBUILD_VERSION="6.1.0"
    DGX_COMMIT_ID="87d8b12"
    DGX_PLATFORM="DGX Server for H100"
    DGX_SERIAL_NUMBER="Not Specified"
    root@dgx-os-6:/# uname -r
    5.19.0-45-generic
    
  4. Run apt update to refresh the local package repository metadata (list of available upgradable packages and their versions).

    root@dgx-os-6:/# apt update
    root@dgx-os-6:/# apt list --upgradeable
    
  5. Use the apt-mark hold command to lock any packages that should not be automatically updated, such as the slurm client.

    root@dgx-os-6:/# apt-mark hold slurm-*
    
  6. Use apt-mark unhold to unhold any packages that should be automatically updated, such as the linux kernel and headers. If the MLNX OFED packages were deployed using the BCM repository, these will be on hold and prevent kernel updates.

    root@dgx-os-6:/# apt-mark unhold linux-*
    
  7. Run apt upgrade to update the DGX OS image.

    root@dgx-os-6:/# apt upgrade
    
  8. Once apt upgrade is completed, verify the updated DGX OS release.

    root@dgx-os-6:/# cat /etc/dgx-release
    DGX_NAME="DGX Server"
    DGX_PRETTY_NAME="NVIDIADGX Server"
    DGX_SWBUILD_DATE="2023-08-09-12-30-10"
    DGX_SWBUILD_VERSION="6.1.0"
    DGX_COMMIT_ID="87d8b12"
    DGX_PLATFORM="DGX Server for H100"
    DGX_SERIAL_NUMBER="FMGY9R3"
    DGX_OTA_VERSION="6.3.2"
    DGX_OTA_DATE="Wed May 28 12:16:11 AM PDT 2025"
    
  9. If updating from a DGX OS release earlier than 6.3.2, you must manually upgrade the datacenter-gpu-manager package from version 3.x to version 4.x. For more information, refer to the instructions in the installation section of the DCGM documentation. First, verify which datacenter-gpu-manager package(s) have been installed.

    root@dgx-os-6:/# apt list --installed | grep datacenter-gpu-manager
    WARNING: apt does not have a stable CLI interface. Use with caution in scripts.
    datacenter-gpu-manager/unknown,now 1:3.1.8 amd64 [installed,upgradable to: 1:3.3.9]
    
  10. Remove any installations of the datacenter-gpu-manager and datacenter-gpu-manager-config packages.

    root@dgx-os-6:/# dpkg --list datacenter-gpu-manager &> /dev/null && apt purge --yes datacenter-gpu-manager
    root@dgx-os-6:/# dpkg --list datacenter-gpu-manager-config &> /dev/null && apt purge --yes datacenter-gpu-manager-config
    
  11. Run apt update to update the local package repository metadata.

    root@dgx-os-6:/# apt update
    
  12. Install the datacenter-gpu-manager-4 package corresponding to the system CUDA version. You can verify the CUDA version installed in the cloned image by issuing the following command (in this case the CUDA version is 12).

    root@dgx-os-6:/# ls /usr/local/ | grep cuda
    cuda
    cuda-12
    cuda-12.2
    root@dgx-os-6:/# apt install --yes --install-recommends datacenter-gpu-manager-4-cuda12
    
  13. Verify the datacenter-gpu-manager packages are installed.

    root@dgx-os-6:/# apt list --installed | grep datacenter-gpu-manager
    
    WARNING: apt does not have a stable CLI interface. Use with caution in scripts.
    
    datacenter-gpu-manager-4-core/unknown,now 1:4.2.3 amd64 [installed,automatic]
    datacenter-gpu-manager-4-cuda12/unknown,now 1:4.2.3 amd64 [installed]
    datacenter-gpu-manager-4-proprietary-cuda12/unknown,now 1:4.2.3 amd64 [installed,automatic]
    datacenter-gpu-manager-4-proprietary/unknown,now 1:4.2.3 amd64 [installed,automatic]
    
  14. Verify the datacenter-gpu-manager version.

    root@dgx-os-6:/# dcgmi -v
    Version : 4.2.3
    Build ID : 11963
    Build Date : 2025-05-01
    Build Type : RelWithDebInfo
    Commit ID : 3effb0b0e49fdcf0b5c742f5ac18da32bb80636b
    Branch Name : v4.2.3
    CPU Arch : x86_64
    Build Platform : Linux 5.15.0-136-generic #147-Ubuntu SMP Sat Mar 15 15:53:30 UTC 2025 x86_64
    CRC : 7b156bd078b95fc6ef05ba9e9272173c
    
  15. Verify Enroot version.

    root@dgx-os-6:/# enroot version
    3.5.0
    
  16. Verify GPU Driver branch and version. Since we are updating to BCM release 10.25.03 and the validated GPU Driver is in the 550 branch, we need to change the GPU Driver branch from 535 to 550.

    root@dgx-os-6:/# apt list --installed nvidia-driver*server
    Listing... Done
    nvidia-driver-535-server/jammy-updates,jammy-security,now 535.247.01-0ubuntu0.22.04.1 amd64 [installed]
    
  17. Show all available GPU driver branches.

    root@dgx-os-6:/# apt list nvidia-driver*server
    Listing... Done
    nvidia-driver-418-server/jammy-updates,jammy-security 418.226.00-0ubuntu5~0.22.04.1 amd64
    nvidia-driver-440-server/jammy-updates,jammy-security 450.248.02-0ubuntu0.22.04.1 amd64
    nvidia-driver-450-server/jammy-updates,jammy-security 450.248.02-0ubuntu0.22.04.1 amd64
    nvidia-driver-460-server/jammy-updates,jammy-security 470.256.02-0ubuntu0.22.04.1 amd64
    nvidia-driver-470-server/jammy-updates,jammy-security 470.256.02-0ubuntu0.22.04.1 amd64
    nvidia-driver-510-server/jammy-updates,jammy-security 515.105.01-0ubuntu0.22.04.1 amd64
    nvidia-driver-515-server/jammy-updates,jammy-security 525.147.05-0ubuntu2.22.04.1 amd64
    nvidia-driver-525-server/jammy-updates,jammy-security 525.147.05-0ubuntu2.22.04.1 amd64
    nvidia-driver-535-server/jammy-updates,jammy-security,now 535.247.01-0ubuntu0.22.04.1 amd64 [installed]
    nvidia-driver-550-server/jammy-updates,jammy-security 550.163.01-0ubuntu0.22.04.1 amd64
    nvidia-driver-565-server/jammy-updates 565.57.01-0ubuntu0.22.04.4 amd64
    nvidia-driver-570-server/jammy-updates,jammy-security 570.133.20-0ubuntu0.22.04.1 amd64
    
  18. Check packages (with the --dry-run option) to install and then upgrade (without the --dry-run option) the NVIDIA GPU driver.

    root@dgx-os-6:/# apt install -y nvidia-driver-550-server linux-modules-nvidia-550-server-nvidia libnvidia-nscq-550 nvidia-modprobe nvidia-fabricmanager-550 nv-persistence-mode --dry-run
    
    root@dgx-os-6:/# apt install -y nvidia-driver-550-server linux-modules-nvidia-550-server-nvidia libnvidia-nscq-550 nvidia-modprobe nvidia-fabricmanager-550 nv-persistence-mode
    
  19. Verify the NVIDIA GPU driver branch and version installed.

    root@dgx-os-6:/# apt list --installed | grep nvidia-driver
    
    WARNING: apt does not have a stable CLI interface. Use with caution in scripts.
    
    nvidia-driver-550-server/jammy-updates,jammy-security,now 550.163.01-0ubuntu0.22.04.1 amd64 [installed]
    
  20. Install or update the CUDA Toolkit to the version that is validated in the BCM.

    root@dgx-os-6:/# apt update
    root@dgx-os-6:/# apt list cuda-toolkit-*
    root@dgx-os-6:/# apt install cuda-toolkit-12-4
    
  21. Verify the installed cuda-toolkit version.

    root@dgx-os-6:/# apt list --installed | grep cuda-toolkit
    
    WARNING: apt does not have a stable CLI interface. Use with caution in scripts.
    
    cuda-toolkit-12-4-config-common/unknown,now 12.4.127-1 all [installed,automatic]
    cuda-toolkit-12-4/unknown,now 12.4.1-1 amd64 [installed]
    cuda-toolkit-12-config-common/unknown,now 12.9.37-1 all [installed,automatic]
    cuda-toolkit-config-common/unknown,now 12.9.37-1 all [installed,automatic]
    
  22. Install additional packages for H100/H200 Systems then exit chroot.

    # H100 Based Systems
    root@dgx-os-6:/# apt install -y dgx-h100-system-configurations kdump-tools linux-crashdump nvidia-crashdump nvsm
    
    # H200 Based Systems
    root@dgx-os-6:/# apt install -y dgx-h200-system-configurations kdump-tools linux-crashdump nvidia-crashdump nvsm
    
    root@dgx-os-6:/# exit
    
  23. Assign the new kernel version to the used image. Select the updated DGX kernel to be used for the updated DGX OS image in cmsh.

    Note

    Please type set kernelversion and then hit the Tab key twice for tab completion to select the updated version.

    Wait until you see the “Initial ramdisk for image dgx-os-6.3.2-h100-image was generated successfully” message before exiting cmsh.

    root@demeter-headnode-01:~# cmsh
    [demeter-headnode-01]% softwareimage
    [demeter-headnode-01->softwareimage]% use dgx-os-6.3.2-h100-image
    [demeter-headnode-01->softwareimage[dgx-os-6.3.2-h100-image]]% set kernelversion 5.15.0-1078-nvidia
    [demeter-headnode-01->softwareimage*[dgx-os-6.3.2-h100-image*]]% commit
    Sun Jun  1 22:15:28 2025 [notice] demeter-headnode-01: Initial ramdisk for image dgx-os-6.3.2-h100-image is being generated
    Sun Jun  1 22:16:15 2025 [notice] demeter-headnode-01: Initial ramdisk for image dgx-os-6.3.2-h100-image was generated successfully
    
  24. Apply the updated DGX OS image to the DGX category in cmsh.

    root@demeter-headnode-01:~# cmsh
    [demeter-headnode-01]% category
    [demeter-headnode-01->category]% use dgx-h100
    [demeter-headnode-01->category[dgx-h100]]% set softwareimage dgx-os-6.3.2-h100-image
    [demeter-headnode-01->category*[dgx-h100*]]% commit
    [demeter-headnode-01->category[dgx-h100]]%
    
  25. Reboot one of the DGX nodes to check the DGX OS updated properly.

    root@demeter-headnode-01:~# pdsh -w dgx-01 reboot
    dgx-01: Connection to dgx-01 closed by remote host.
    
  26. Once verified the DGX node booted up with the updated DGX OS, proceed to reboot the remaining DGX nodes.

    root@demeter-headnode-01:~# pdsh -w dgx-[02-31] reboot
    
  27. If needed, update / drivers as well as following the sections found later in this document.