NMC Prerequisites#
Before installing NVIDIA Mission Control software components, ensure the following prerequisites are in place.
Network Planning#
The NMC software stack requires several Virtual IP (VIP) addresses.
In each case, the VIP must be allocated from the same subnet as the corresponding interface (inband or out-of-band) on the nodes that will hold it.
The BCM HA VIPs are held by the active BCM head node; the remaining VIPs are implemented via MetalLB on the k8s-system-admin nodes.
In this context, inband refers to the internalnet collection of subnets used for cluster-internal traffic.
Coordinate with your network administrator to reserve these addresses before beginning installation.
Component |
VIP |
Purpose |
|---|---|---|
BCM |
Inband HA VIP |
HA VIP on the BCM head node inband subnet, held by the active BCM head node |
BCM |
Out-of-band HA VIP |
HA VIP on the BCM head node out-of-band subnet, held by the active BCM head node |
Run:ai |
Control-plane VIP (inband) |
Single entry point for the Run:ai UI, API, and workspaces |
Run:ai |
Inference VIP (inband) |
Entry point for inference workloads (NIMs, customer inference services) |
Autonomous Recovery Engine (AHR) |
Control-plane VIP (inband) |
Access to the AHR web UI, APIs, and runbook engine |
NetQ |
Control-plane VIP (inband) |
Ingress point for the NetQ UI and APIs |
DGX-OS Image Configuration#
The default DGX-OS image, installed by the BCM 11 installer or through the cm-create-image command, contains all the software required to operate a DGX GB200 NVL72 system.
However, the default configuration might not be optimized for your specific cluster architecture.
This section describes additional modifications you can apply to the OS image to better support the rest of the software stack.
Use the following steps to further configure your DGX-OS image for use in a GB200 NVL72 cluster:
Locate your DGX-OS image and enter the chroot
cm-chroot /cm/images/dgxos-7.2-imageOnce inside the chrooted image, run this command to create a script within the chroot.
cat > /tmp/dgx-post-install.sh <<EOF #!/bin/bash set -e set -x setup_cleanup() { apt update && apt install -y python3-xmltodict systemctl disable nvidia-fabricmanager # RC10 issue rm -f /etc/apt/sources.list.d/doca.list } setup_dcgm() { systemctl enable nvidia-dcgm-exporter echo "DCGM_EXP_XID_ERRORS_COUNT, gauge, Count of XID Errors within user-specified time window (see xid-count-window-size param)." >> /etc/dcgm-exporter/default-counters.csv } setup_ib() { systemctl enable openibd rm -f /etc/libibverbs.d/vmw_pvrdma.driver } setup_slurm() { # These modifications are documented in the NMC guide for installing slurm. # if the image isn't actually a slurm worker image, this does not hurt. cat > /etc/sysconfig/slurmd <<EOT PMIX_MCA_ptl=^usock PMIX_MCA_psec=none PMIX_SYSTEM_TMPDIR=/var/empty PMIX_MCA_gds=hash EOT } setup_tuning() { cat > /etc/sysctl.d/99-sysctl.conf <<EOT net.ipv4.conf.all.rp_filter=1 net.ipv4.conf.default.rp_filter=1 net.ipv4.conf.enP6p3s0f0np0.rp_filter=1 net.ipv4.conf.lo.rp_filter=1 EOT } disable_nvsm_efi_check { sed -i '/Status of volumes/s/^[[:space:]]*#//' /cm/local/apps/cmd/scripts/healthchecks/configfiles/nvsm_show_health.py } setup_cleanup setup_ib setup_dcgm setup_slurm setup_tuning disable_nvsm_efi_check EOF
Make the script executable using the following command
chmod +x /tmp/dgx-post-install.sh
Run the following script
/tmp/dgx-post-install.sh
Exit the chroot using the following command:
exit