Deploying#

This section describes how to install the software components for your MNNVL system.

Operating System Requirements#

The MNNVL compute trays use NVIDIA Grace, ARM-based, CPUs.

NVIDIA provides guides for installing Linux distributions on Grace CPU platforms at the following links:

Important

Before installing NVIDIA software follow the directions from your manufacturer to update all system firmware.

Installing NVIDIA Software#

The following steps describe how to install the NVIDIA software packages on Ubuntu 24.04.

Remove Existing Packages#

Remove any existing NVIDIA packages:

$ sudo apt purge nvidia-*
$ sudo apt purge cuda*
$ sudo apt purge libxnvctrl0
$ sudo apt purge glx-*

Note

These packages are not installed on a new Ubuntu 24.04 install. The apt command will not find the packages when running apt purge.

Clean the Local APT Cache#

Clean and reset the existing APT cache, run the following commands.

$ sudo apt autoremove
$ sudo apt autoclean

Unload the Kernel Modules#

Unload the kernel modules, run the following commands.

$ sudo modprobe -r mods
$ sudo modprobe -r nvidia_uvm
$ sudo modprobe -r nvidia_drm
$ sudo modprobe -r nvidia_modeset
$ sudo modprobe -r nvidia_vgpu_vfio
$ sudo modprobe -r nvidia_peermem
$ sudo modprobe -r nvidia
$ sudo modprobe -r nvidiafb

Install the NVIDIA Linux Kernel and Packages#

For optimal performance, the Grace-Blackwell systems require the NVIDIA-specific Linux kernel.

Note

If you use a custom kernel refer to the NVIDIA Grace Platform Support Software Patches and Congurations for a list of kernel customizations required for the Grace CPU.

Remove the existing Linux image, headers and modules:

sudo apt remove linux-image-$(uname -r) linux-headers-$(uname -r) linux-modules-$(uname -r)

Install the NVIDIA Linux kernel:

sudo apt install linux-nvidia-64k-hwe-24.04

Reboot the system to load the new NVIDIA kernel:

sudo reboot

Install the Kernel Build Packages#

Install packages required to install the NVIDIA drivers:

sudo apt update -y && sudo apt install -y gcc dkms make

Install IMEX and the NVIDIA GPU Driver#

Note

Contact NVIDIA for access to the Blackwell GPU driver and IMEX packages.

Installing the driver and IMEX packages with APT requires installing a local APT repository and configuring APT to prefer the local repository for those packages.

Important

The following are example filenames. Check your specific release for the correct filenames and versions.

Here are steps to install the driver with APT:

  1. Copy the nvidia-driver-local-repo package to the compute tray and install with sudo apt install.

sudo apt install ./nvidia-driver-local-repo-ubuntu2404-570.124.06_1.0-1_arm64.deb
  1. Install a copy the local repository GPG key to the Ubuntu keyring.

The output of the install step provides the correct GPG key filename.

For example:

Selecting previously unselected package nvidia-driver-local-repo-ubuntu2404-570.124.06.
(Reading database ... 67260 files and directories currently installed.)
Preparing to unpack nvidia-driver-local-repo-ubuntu2404-570.124.06_1.0-1_arm64.deb ...
Unpacking nvidia-driver-local-repo-ubuntu2404-570.124.06 (1.0-1) ...
Setting up nvidia-driver-local-repo-ubuntu2404-570.124.06 (1.0-1) ...

The public nvidia-driver-local-repo-ubuntu2404-570.124.06 GPG key does not appear to be installed.
To install the key, run this command:
sudo cp /var/nvidia-driver-local-repo-ubuntu2404-570.124.06/nvidia-driver-local-9A30370E-keyring.gpg /usr/share/keyrings/
  1. Create an APT pin for NVIDIA specific packages to prevent changes in upstream packages from conflicting with the MNNVL specific packages.

Create the file /etc/apt/preferences.d/00-nvidia-prefer with the contents

Package: *
Pin: origin ""
Pin-Priority: 1001
  1. Update the APT cache with apt update

sudo apt update
  1. Install the CUDA Toolkit package:

apt install cuda-toolkit
  1. Remove any existing GPU hardware settings by deleting the file /etc/modprobe.d/nvidia.conf.

sudo rm /etc/modprobe.d/nvidia.conf
  1. Install the NVIDIA driver and modprobe packages:

sudo apt install nvidia-dkms-570-open nvidia-driver-570-open nvidia-modprobe
  1. Install IMEX.

sudo apt-get install nvidia-imex

Install DOCA Host#

The DOCA Host package provides the OS drivers for the BlueField-3 and ConnectX adapters.

Remove older versions of DOCA and adapter drivers

Note

If there were no drivers previously installed the ofed_uninstall.sh script won’t be found.

$ for f in $( dpkg --list | grep -E 'doca|flexio|dpa-gdbserver|dpa-stats|dpaeumgmt' | awk '{print $2}' ); do echo $f ; apt remove --purge $f -y ; done
$ /usr/sbin/ofed_uninstall.sh --force
$ sudo apt-get autoremove

Install the DOCA host Debian package and update APT.

$ sudo dpkg -i doca-host-2.10.0-093509-25.01-ubuntu2404
$ sudo apt-get update

Install the doca-extra package to enable NVIDIA kernel support.

sudo apt install -y doca-extra

Rebuild the kernel headers with doca-kernel-support.

sudo /opt/mellanox/doca/tools/doca-kernel-support

The doca-kernel-support command generates a new Debian package file, install this file using dpkg -i <package_file>.

For example:

$ sudo /opt/mellanox/doca/tools/doca-kernel-support
doca-kernel-support: Built single package: /tmp/DOCA.EuUfkWfV7Z/doca-kernel-repo-2.9.0-1.kver.5.14.0.356.el9.arm.deb
doca-kernel-support: Done
$ dpkg -i /tmp/DOCA.EuUfkWfV7Z/doca-kernel-repo-2.9.0-1.kver.5.14.0.356.el9.arm.deb

Update APT and install the doca-all host profile.

sudo apt update
sudo apt -y install doca-all

Note

doca-kernel-support doesn’t support customized or unofficial kernels. Refer to Installing Software on Host for more information about the installation.

Configure the NVIDIA Software Packages#

This section details the configuration options for the GPU driver and IMEX application.

Enable Profiling#

Enable profiling for all users by applying the settings to /etc/modprobe.d/nvprofiling.conf.

echo 'options nvidia NVreg_RestrictProfilingToAdminUsers=0' | sudo tee /etc/modprobe.d/nvprofiling.conf

Enable the IMEX Control Channel#

Configure IMEX to automatically enable the IMEX control channel by configuring /etc/modprobe.d/nvidia.conf.

$ cat /etc/modprobe.d/nvidia.conf

options nvidia NVreg_CreateImexChannel0=1

Note

Multitenent MNNVL systems require more complex IMEX configuration beyond the scope of this document.

Refer to the IMEX documentation for more information.

Configure IMEX Peers#

Create the file /etc/nvidia-imex/nodes_config.cfg with the management Ethernet interface IP address of all nodes in the cluster.

Here is an example /etc/nvidia-imex/nodes_config.cfg file:

Note

Place one IP address per line, with no other information.

10.114.228.6
10.114.228.7
10.114.228.8
10.114.228.9
10.114.228.10
10.114.228.11
10.114.228.12
10.114.228.13
10.114.228.14

Note

IMEX can use any compute tray IP address, as long as there is connectivty between all IP addresses in the configuration.

Enable the NVIDIA Persistence Daemon#

Configure the NVIDIA persistence daemon by editing the /etc/systemd/system/nvidia-persistenced.service file.

$ cat /etc/systemd/system/nvidia-persistenced.service
[Unit]
Description=NVIDIA Persistence Daemon
Wants=syslog.target

[Service]
Type=forking
PIDFile=/var/run/nvidia-persistenced/nvidia-persistenced.pid
Restart=always
ExecStart=/usr/bin/nvidia-persistenced --verbose
ExecStopPost=/bin/rm -rf /var/run/nvidia-persistenced

[Install]
WantedBy=multi-user.target

Enable and start the daemon:

sudo systemctl enable nvidia-persistenced.service

Enable the IMEX Daemon#

Enable the IMEX service to start at boot:

sudo systemctl enable nvidia-imex.service

Rebuild the initramfs#

After changing the IMEX configuration rebuild the initramfs.

sudo update-initramfs -u -k all

Reboot the Compute Tray#

Reboot the compute tray to complete the driver and application configurations.

sudo reboot

Switch Tray Software#

Each NVLink switch tray runs a local instance of the NVOS software. The NVLink switches ship from the factory with NVOS preinstalled.

NVOS contains all the required software and firmware components.

Tip

Refer to your manufacturer for the lastest version of NVOS and upgrade instructions.

To show the firmware information for the platform firmware components, run the following command.

$ nv show platform firmware

Here is an example output:

The actual firmware version on your system might be different.

nv show platform firmware output

Figure 2 nv show platform firmware output#