Post-Installation Steps for DGX-OS#

The default DGX-OS image, installed by the BCM 11 installer or through the cm-create-image command, contains all the software required to operate a DGX GB200 NVL72 system. However, the default configuration might not be optimized for your specific cluster architecture.

This section describes additional modifications you can apply to the OS image to better support the rest of the software stack.

Use the following steps to further configure your DGX-OS image for use in a GB200 NVL72 cluster:

  1. Locate your DGX-OS image and enter the chroot

    cm-chroot /cm/images/dgxos-7.2-image
    
  2. Once inside the chrooted image, run this command to create a script within the chroot.

    cat > /tmp/dgx-post-install.sh <<EOF
    #!/bin/bash
    
    set -e
    set -x
    
    
    setup_cleanup() {
      apt install -y python3-xmltodict
      systemctl disable nvidia-fabricmanager # RC10 issue
      rm -f /etc/apt/sources.list.d/doca.list
    }
    
    setup_dcgm() {
      systemctl enable nvidia-dcgm-exporter
      echo "DCGM_EXP_XID_ERRORS_COUNT, gauge, Count of XID Errors within user-specified time window (see xid-count-window-size param)." >> /etc/dcgm-exporter/default-counters.csv
    }
    
    setup_ib() {
      systemctl enable openibd
      rm -f /etc/libibverbs.d/vmw_pvrdma.driver
    }
    
    setup_slurm() {
      # These modifications are documented in the NMC guide for installing slurm.
      # if the image isn't actually a slurm worker image, this does not hurt.
      cat > /etc/sysconfig/slurmd <<EOT
    PMIX_MCA_ptl=^usock
    PMIX_MCA_psec=none
    PMIX_SYSTEM_TMPDIR=/var/empty
    PMIX_MCA_gds=hash
    EOT
      cat > /etc/enroot/mounts.d/30-imex.fstab <<EOT
    /dev/nvidia-caps-imex-channels
    EOT
    }
    
    setup_tuning() {
      cat > /etc/sysctl.d/99-sysctl.conf  <<EOT
    net.ipv4.conf.all.rp_filter=1
    net.ipv4.conf.default.rp_filter=1
    net.ipv4.conf.enP6p3s0f0np0.rp_filter=1
    net.ipv4.conf.lo.rp_filter=1
    EOT
    }
    
    setup_cleanup
    setup_ib
    setup_dcgm
    setup_slurm
    setup_tuning
    
    EOF
    
  3. Make the script executable using the following command

    chmod +x /tmp/dgx-post-install.sh
    
  4. Run the following script

    /tmp/dgx-post-install.sh
    
  5. Then Exit the chroot using the following command:

    exit