Post-Installation Steps for DGX-OS#
The default DGX-OS image, installed by the BCM 11 installer or through the cm-create-image command, contains all the software required to operate a DGX GB200 NVL72 system.
However, the default configuration might not be optimized for your specific cluster architecture.
This section describes additional modifications you can apply to the OS image to better support the rest of the software stack.
Use the following steps to further configure your DGX-OS image for use in a GB200 NVL72 cluster:
Locate your DGX-OS image and enter the chroot
cm-chroot /cm/images/dgxos-7.2-imageOnce inside the chrooted image, run this command to create a script within the chroot.
cat > /tmp/dgx-post-install.sh <<EOF #!/bin/bash set -e set -x setup_cleanup() { apt install -y python3-xmltodict systemctl disable nvidia-fabricmanager # RC10 issue rm -f /etc/apt/sources.list.d/doca.list } setup_dcgm() { systemctl enable nvidia-dcgm-exporter echo "DCGM_EXP_XID_ERRORS_COUNT, gauge, Count of XID Errors within user-specified time window (see xid-count-window-size param)." >> /etc/dcgm-exporter/default-counters.csv } setup_ib() { systemctl enable openibd rm -f /etc/libibverbs.d/vmw_pvrdma.driver } setup_slurm() { # These modifications are documented in the NMC guide for installing slurm. # if the image isn't actually a slurm worker image, this does not hurt. cat > /etc/sysconfig/slurmd <<EOT PMIX_MCA_ptl=^usock PMIX_MCA_psec=none PMIX_SYSTEM_TMPDIR=/var/empty PMIX_MCA_gds=hash EOT cat > /etc/enroot/mounts.d/30-imex.fstab <<EOT /dev/nvidia-caps-imex-channels EOT } setup_tuning() { cat > /etc/sysctl.d/99-sysctl.conf <<EOT net.ipv4.conf.all.rp_filter=1 net.ipv4.conf.default.rp_filter=1 net.ipv4.conf.enP6p3s0f0np0.rp_filter=1 net.ipv4.conf.lo.rp_filter=1 EOT } setup_cleanup setup_ib setup_dcgm setup_slurm setup_tuning EOF
Make the script executable using the following command
chmod +x /tmp/dgx-post-install.sh
Run the following script
/tmp/dgx-post-install.sh
Then Exit the chroot using the following command:
exit