Deploying#
This section describes how to install the software components for your MNNVL system.
Operating System Requirements#
The MNNVL compute trays use NVIDIA Grace, ARM-based, CPUs.
NVIDIA provides guides for installing Linux distributions on Grace CPU platforms at the following links:
NVIDIA Grace Software with Red Hat Enterprise Linux 9 Installation Guide
NVIDIA Grace Software with SUSE Linux Enterprise Server 15 Installation Guide
Important
Before installing NVIDIA software follow the directions from your manufacturer to update all system firmware.
Installing NVIDIA Software#
The following steps describe how to install the NVIDIA software packages on Ubuntu 24.04.
Remove Existing Packages#
Remove any existing NVIDIA packages:
$ sudo apt purge nvidia-*
$ sudo apt purge cuda*
$ sudo apt purge libxnvctrl0
$ sudo apt purge glx-*
Note
These packages are not installed on a new Ubuntu 24.04 install. The apt
command will not find the packages when running apt purge
.
Clean the Local APT Cache#
Clean and reset the existing APT cache, run the following commands.
$ sudo apt autoremove
$ sudo apt autoclean
Unload the Kernel Modules#
Unload the kernel modules, run the following commands.
$ sudo modprobe -r mods
$ sudo modprobe -r nvidia_uvm
$ sudo modprobe -r nvidia_drm
$ sudo modprobe -r nvidia_modeset
$ sudo modprobe -r nvidia_vgpu_vfio
$ sudo modprobe -r nvidia_peermem
$ sudo modprobe -r nvidia
$ sudo modprobe -r nvidiafb
Install the NVIDIA Linux Kernel and Packages#
For optimal performance, the Grace-Blackwell systems require the NVIDIA-specific Linux kernel.
Note
If you use a custom kernel refer to the NVIDIA Grace Platform Support Software Patches and Congurations for a list of kernel customizations required for the Grace CPU.
Remove the existing Linux image, headers and modules:
sudo apt remove linux-image-$(uname -r) linux-headers-$(uname -r) linux-modules-$(uname -r)
Install the NVIDIA Linux kernel:
sudo apt install linux-nvidia-64k-hwe-24.04
Reboot the system to load the new NVIDIA kernel:
sudo reboot
Install the Kernel Build Packages#
Install packages required to install the NVIDIA drivers:
sudo apt update -y && sudo apt install -y gcc dkms make
Install IMEX and the NVIDIA GPU Driver#
Note
Contact NVIDIA for access to the Blackwell GPU driver and IMEX packages.
Installing the driver and IMEX packages with APT requires installing a local APT repository and configuring APT to prefer the local repository for those packages.
Important
The following are example filenames. Check your specific release for the correct filenames and versions.
Here are steps to install the driver with APT:
Copy the
nvidia-driver-local-repo
package to the compute tray and install withsudo apt install
.
sudo apt install ./nvidia-driver-local-repo-ubuntu2404-570.124.06_1.0-1_arm64.deb
Install a copy the local repository GPG key to the Ubuntu keyring.
The output of the install step provides the correct GPG key filename.
For example:
Selecting previously unselected package nvidia-driver-local-repo-ubuntu2404-570.124.06.
(Reading database ... 67260 files and directories currently installed.)
Preparing to unpack nvidia-driver-local-repo-ubuntu2404-570.124.06_1.0-1_arm64.deb ...
Unpacking nvidia-driver-local-repo-ubuntu2404-570.124.06 (1.0-1) ...
Setting up nvidia-driver-local-repo-ubuntu2404-570.124.06 (1.0-1) ...
The public nvidia-driver-local-repo-ubuntu2404-570.124.06 GPG key does not appear to be installed.
To install the key, run this command:
sudo cp /var/nvidia-driver-local-repo-ubuntu2404-570.124.06/nvidia-driver-local-9A30370E-keyring.gpg /usr/share/keyrings/
Create an APT pin for NVIDIA specific packages to prevent changes in upstream packages from conflicting with the MNNVL specific packages.
Create the file /etc/apt/preferences.d/00-nvidia-prefer
with the contents
Package: *
Pin: origin ""
Pin-Priority: 1001
Update the APT cache with
apt update
sudo apt update
Install the CUDA Toolkit package:
apt install cuda-toolkit
Remove any existing GPU hardware settings by deleting the file
/etc/modprobe.d/nvidia.conf
.
sudo rm /etc/modprobe.d/nvidia.conf
Install the NVIDIA driver and modprobe packages:
sudo apt install nvidia-dkms-570-open nvidia-driver-570-open nvidia-modprobe
Install IMEX.
sudo apt-get install nvidia-imex
Install DOCA Host#
The DOCA Host package provides the OS drivers for the BlueField-3 and ConnectX adapters.
Remove older versions of DOCA and adapter drivers
Note
If there were no drivers previously installed the ofed_uninstall.sh
script won’t be found.
$ for f in $( dpkg --list | grep -E 'doca|flexio|dpa-gdbserver|dpa-stats|dpaeumgmt' | awk '{print $2}' ); do echo $f ; apt remove --purge $f -y ; done
$ /usr/sbin/ofed_uninstall.sh --force
$ sudo apt-get autoremove
Install the DOCA host Debian package and update APT.
$ sudo dpkg -i doca-host-2.10.0-093509-25.01-ubuntu2404
$ sudo apt-get update
Install the doca-extra
package to enable NVIDIA kernel support.
sudo apt install -y doca-extra
Rebuild the kernel headers with doca-kernel-support
.
sudo /opt/mellanox/doca/tools/doca-kernel-support
The doca-kernel-support
command generates a new Debian package file, install this file using dpkg -i <package_file>
.
For example:
$ sudo /opt/mellanox/doca/tools/doca-kernel-support
doca-kernel-support: Built single package: /tmp/DOCA.EuUfkWfV7Z/doca-kernel-repo-2.9.0-1.kver.5.14.0.356.el9.arm.deb
doca-kernel-support: Done
$ dpkg -i /tmp/DOCA.EuUfkWfV7Z/doca-kernel-repo-2.9.0-1.kver.5.14.0.356.el9.arm.deb
Update APT and install the doca-all
host profile.
sudo apt update
sudo apt -y install doca-all
Note
doca-kernel-support doesn’t support customized or unofficial kernels. Refer to Installing Software on Host for more information about the installation.
Configure the NVIDIA Software Packages#
This section details the configuration options for the GPU driver and IMEX application.
Enable Profiling#
Enable profiling for all users by applying the settings to /etc/modprobe.d/nvprofiling.conf
.
echo 'options nvidia NVreg_RestrictProfilingToAdminUsers=0' | sudo tee /etc/modprobe.d/nvprofiling.conf
Enable the IMEX Control Channel#
Configure IMEX to automatically enable the IMEX control channel by configuring /etc/modprobe.d/nvidia.conf
.
$ cat /etc/modprobe.d/nvidia.conf
options nvidia NVreg_CreateImexChannel0=1
Note
Multitenent MNNVL systems require more complex IMEX configuration beyond the scope of this document.
Refer to the IMEX documentation for more information.
Configure IMEX Peers#
Create the file /etc/nvidia-imex/nodes_config.cfg
with the management Ethernet interface IP address of all nodes in the cluster.
Here is an example /etc/nvidia-imex/nodes_config.cfg
file:
Note
Place one IP address per line, with no other information.
10.114.228.6
10.114.228.7
10.114.228.8
10.114.228.9
10.114.228.10
10.114.228.11
10.114.228.12
10.114.228.13
10.114.228.14
Note
IMEX can use any compute tray IP address, as long as there is connectivty between all IP addresses in the configuration.
Enable the NVIDIA Persistence Daemon#
Configure the NVIDIA persistence daemon by editing the /etc/systemd/system/nvidia-persistenced.service
file.
$ cat /etc/systemd/system/nvidia-persistenced.service
[Unit]
Description=NVIDIA Persistence Daemon
Wants=syslog.target
[Service]
Type=forking
PIDFile=/var/run/nvidia-persistenced/nvidia-persistenced.pid
Restart=always
ExecStart=/usr/bin/nvidia-persistenced --verbose
ExecStopPost=/bin/rm -rf /var/run/nvidia-persistenced
[Install]
WantedBy=multi-user.target
Enable and start the daemon:
sudo systemctl enable nvidia-persistenced.service
Enable the IMEX Daemon#
Enable the IMEX service to start at boot:
sudo systemctl enable nvidia-imex.service
Rebuild the initramfs#
After changing the IMEX configuration rebuild the initramfs
.
sudo update-initramfs -u -k all
Reboot the Compute Tray#
Reboot the compute tray to complete the driver and application configurations.
sudo reboot
Switch Tray Software#
Each NVLink switch tray runs a local instance of the NVOS software. The NVLink switches ship from the factory with NVOS preinstalled.
NVOS contains all the required software and firmware components.
Tip
Refer to your manufacturer for the lastest version of NVOS and upgrade instructions.
To show the firmware information for the platform firmware components, run the following command.
$ nv show platform firmware
Here is an example output:
The actual firmware version on your system might be different.

Figure 2 nv show platform firmware output#
Enable the NVLink Cluster#
The NVIDIA Management Controller (NMX-C) service runs on a single NVLink switch. NMX-C manages the network configuration for all NVLink switches.
On a compute or management node download and install GRPCurl
Using grpcurl
create a gRPC connection, replacing 127.0.0.1
with the NVLink switch management IP address.
grpcurl -plaintext -d \
"{ \"gatewayId\": \"nvidia-bringup\", \
\"major_version\": \"PROTO_MSG_MAJOR_VERSION\", \
\"minor_version\": \"PROTO_MSG_MINOR_VERSION\" }" \
127.0.0.1:9371 nmx_c.NMX_Controller.Hello
After creating the gRPC connection, configure the FM topology, again replacing 127.0.0.1
with the NVLink switch management IP address.
grpcurl -plaintext -d \
'{ \"gatewayId\": \"nvidia-bringup\", \
\"staticConfig\": { \"configKeyVals\": \
{ \"configKeyVal\": [ { \"configFileName\": \
\"fm_config\", \"key\": \"MNNVL_TOPOLOGY\", \
\"value\": \"gb200_nvl72r1_c2g4_topology\"}]}}}' \
127.0.0.1:9371 nmx_c.NMX_Controller.SetStaticConfig
Note
If the gRPC option isn’t possible, use the NVOS CLI to generate an FM configuration file.
nv action generate sdn config app nmx-controller type fm_config
Download the generated file and edit the MNNVL_TOPOLOGY
to MNNVL_TOPOLOGY=gb200_nvl72r1_c2g4_topology
Upload the configuration file back to the switch and install it with the command
nv action install sdn config app nmx-controller type fm_config files <FILENAME>
After configuring FM, enable the cluster and save the configuration.
nv set cluster state enabled
nv config apply
nv config save
Start the NMX-C application.
nv action start cluster app nmx-controlle
Verify the cluster and NMX-C application are running.
nv show cluster apps running