Installing or Upgrading BaseOS Components#
This section provides details for installing additional software or upgrading components of BaseOS, such as a newer GPU driver branch.
Important
When installing new software or upgrading an existing component to a newer branch, you should update the system image to install the latest versions of the new software component.
Refer to Updating the BaseOS Image for instruction on updating the system image with the latest version of the installed software.
The instructions in this section assume that you have entered the context of the BaseOS image using the cm-chroot-sw-img command:
$ cm-chroot-sw-img baseos-image
Changing the GPU driver branch#
The following instructions describe the steps for upgrading or installing a different GPU driver branch.
Check and verify the currently installed GPU driver branch:
# apt list --installed nvidia-driver*serverListing... Done nvidia-driver-535-server/jammy-updates,jammy-security,now535.247.01-0ubuntu0.22.04.1 amd64 [installed]
Show all available GPU driver branches.
# apt listnvidia-driver*serverListing... Done nvidia-driver-535-server/jammy-updates,jammy-security,now 535.247.01-0ubuntu0.22.04.1 amd64 [installed] nvidia-driver-570-server/jammy-updates,jammy-security 570.133.20-0ubuntu0.22.04.1 amd64
Install the packages for the selected driver branch. Use the option
--dry-runto validate the … without installing it:# apt install -y nvidia-driver-550-server linux-modules-nvidia-550-server-nvidia libnvidia-nscq-550nvidia-modprobe nvidia-fabricmanager-550 nv-persistence-mode --dry-runNow, install the driver (without
--dry-run)# apt install -y nvidia-driver-550-server linux-modules-nvidia-550-server-nvidia libnvidia-nscq-550nvidia-modprobe nvidia-fabricmanager-550 nv-persistence-modeVerify that the new NVIDIA GPU driver branch and version is installed.
# apt list --installed | grep nvidia-driverWARNING: apt does not have a stable CLI interface. Use with caution in scripts. nvidia-driver-550-server/jammy-updates,jammy-security,now550.163.01-0ubuntu0.22.04.1 amd64 [installed]
Changing the CUDA Toolkit Version#
This chapter describes the steps for installing or upgrading the system to a new CUDA Toolkit release.
Installing or update the CUDA Toolkit to the version that is validated in the BCM
List all available CUDA versions.
# apt list cuda-toolkit-*Install the ..
# apt install cuda-toolkit-12-4Verify the installed cuda-toolkit version
# apt list --installed | grep cuda-toolkitWARNING: apt does not have a stable CLI interface. Use with caution in scripts. cuda-toolkit-12-4-config-common/unknown,now 12.4.127-1 all [installed,automatic] cuda-toolkit-12-4/unknown,now 12.4.1-1 amd64 [installed] cuda-toolkit-12-config-common/unknown,now 12.9.37-1 all [installed,automatic] cuda-toolkit-config-common/unknown,now 12.9.37-1 all [installed,automatic]
TODO: NCCL?? GDS??
Upgrading the Data Center GPU Manager (DCGM)#
To change the DCGM version installed in the BaseOS image, follow these steps:
Validate the version that has been installed in the system image:
# apt list --installed | grep datacenter-gpu-managerWARNING: apt does not have a stable CLI interface. Use with caution in scripts. datacenter-gpu-manager/unknown,now 1:3.1.8 amd64 [installed,upgradable to: 1:3.3.9]
Remove any installations of the
datacenter-gpu-managerand datacenter-gpu-manager-config` packages.# dpkg --list datacenter-gpu-manager &> /dev/null && apt purge --yes datacenter-gpu-manager # dpkg --listdatacenter-gpu-manager-config &> /dev/null && apt purge --yes datacenter-gpu-manager-config
Install the
datacenter-gpu-manager-4package corresponding to the system CUDA version. You can verify the CUDA version installed in the cloned image by issuing the following command, in this case the CUDA version is 12.# ls /usr/local/ | grep cudacuda cuda-12 cuda-12.2
# apt install -y --install-recommends datacenter-gpu-manager-4-cuda12Verify the datacenter-gpu-manager packages are installed.
# apt list --installed | grep datacenter-gpu-managerWARNING: apt does not have a stable CLI interface. Use with caution in scripts. datacenter-gpu-manager-4-core/unknown,now 1:4.2.3 amd64 [installed,automatic] datacenter-gpu-manager-4-cuda12/unknown,now 1:4.2.3 amd64 [installed] datacenter-gpu-manager-4-proprietary-cuda12/unknown,now 1:4.2.3 amd64 [installed,automatic] datacenter-gpu-manager-4-proprietary/unknown,now 1:4.2.3 amd64[installed,automatic]
Verify the datacenter-gpu-manager version.
# dcgmi -vVersion : 4.2.3 Build ID : 11963 Build Date : 2025-05-01 Build Type : RelWithDebInfo Commit ID : 3effb0b0e49fdcf0b5c742f5ac18da32bb80636b Branch Name : v4.2.3 CPU Arch : x86_64 Build Platform : Linux 5.15.0-136-generic #147-Ubuntu SMP Sat Mar 15 15:53:30 UTC 2025 x86_64 CRC : 7b156bd078b95fc6ef05ba9e9272173c
Upgrading DOCA OFED#
TODO
Installing Additional Software#
Users have the option to install additional packages into the BaseOS image, including Ubuntu packages of NVIDIA software tools.
# apt install -y kdump-tools linux-crashdump nvidia-crashdump