NMC Prerequisites#

Before installing NVIDIA Mission Control software components, ensure the following prerequisites are in place.

Network Planning#

The NMC software stack requires several Virtual IP (VIP) addresses. In each case, the VIP must be allocated from the same subnet as the corresponding interface (inband or out-of-band) on the nodes that will hold it. The BCM HA VIPs are held by the active BCM head node; the remaining VIPs are implemented via MetalLB on the k8s-system-admin nodes. In this context, inband refers to the internalnet collection of subnets used for cluster-internal traffic. Coordinate with your network administrator to reserve these addresses before beginning installation.

Table 3 Required VIPs#

Component

VIP

Purpose

BCM

Inband HA VIP

HA VIP on the BCM head node inband subnet, held by the active BCM head node

BCM

Out-of-band HA VIP

HA VIP on the BCM head node out-of-band subnet, held by the active BCM head node

Run:ai

Control-plane VIP (inband)

Single entry point for the Run:ai UI, API, and workspaces

Run:ai

Inference VIP (inband)

Entry point for inference workloads (NIMs, customer inference services)

Autonomous Recovery Engine (AHR)

Control-plane VIP (inband)

Access to the AHR web UI, APIs, and runbook engine

NetQ

Control-plane VIP (inband)

Ingress point for the NetQ UI and APIs

DGX-OS Image Configuration#

The default DGX-OS image, installed by the BCM 11 installer or through the cm-create-image command, contains all the software required to operate a DGX GB200 NVL72 system. However, the default configuration might not be optimized for your specific cluster architecture.

This section describes additional modifications you can apply to the OS image to better support the rest of the software stack.

Use the following steps to further configure your DGX-OS image for use in a GB200 NVL72 cluster:

  1. Locate your DGX-OS image and enter the chroot

    cm-chroot /cm/images/dgxos-7.2-image
    
  2. Once inside the chrooted image, run this command to create a script within the chroot.

    cat > /tmp/dgx-post-install.sh <<EOF
    #!/bin/bash
    
    set -e
    set -x
    
    
    setup_cleanup() {
      apt update && apt install -y python3-xmltodict
      systemctl disable nvidia-fabricmanager # RC10 issue
      rm -f /etc/apt/sources.list.d/doca.list
    }
    
    setup_dcgm() {
      systemctl enable nvidia-dcgm-exporter
      echo "DCGM_EXP_XID_ERRORS_COUNT, gauge, Count of XID Errors within user-specified time window (see xid-count-window-size param)." >> /etc/dcgm-exporter/default-counters.csv
    }
    
    setup_ib() {
      systemctl enable openibd
      rm -f /etc/libibverbs.d/vmw_pvrdma.driver
    }
    
    setup_slurm() {
      # These modifications are documented in the NMC guide for installing slurm.
      # if the image isn't actually a slurm worker image, this does not hurt.
      cat > /etc/sysconfig/slurmd <<EOT
    PMIX_MCA_ptl=^usock
    PMIX_MCA_psec=none
    PMIX_SYSTEM_TMPDIR=/var/empty
    PMIX_MCA_gds=hash
    EOT
    }
    
    setup_tuning() {
      cat > /etc/sysctl.d/99-sysctl.conf  <<EOT
    net.ipv4.conf.all.rp_filter=1
    net.ipv4.conf.default.rp_filter=1
    net.ipv4.conf.enP6p3s0f0np0.rp_filter=1
    net.ipv4.conf.lo.rp_filter=1
    EOT
    }
    
    disable_nvsm_efi_check {
      sed -i '/Status of volumes/s/^[[:space:]]*#//' /cm/local/apps/cmd/scripts/healthchecks/configfiles/nvsm_show_health.py
    }
    
    setup_cleanup
    setup_ib
    setup_dcgm
    setup_slurm
    setup_tuning
    disable_nvsm_efi_check
    
    EOF
    
  3. Make the script executable using the following command

    chmod +x /tmp/dgx-post-install.sh
    
  4. Run the following script

    /tmp/dgx-post-install.sh
    
  5. Exit the chroot using the following command:

    exit