Post-Installation Steps for DGX-OS#

Note

Consider changing the logrotate policy from weekly to daily or hourly to prevent the /var filesystem from filling up which can cause various services to crash.

Run the following commands on the BCM head node. If High Availability (HA) is configured, run these commands on both the active and standby head nodes:

# Update rsyslog rotation to daily or hourly
sed -i 's/weekly/daily/' /etc/logrotate.d/rsyslog
# sed -i 's/weekly/hourly/' /etc/logrotate.d/rsyslog

# Update logrotate timer to daily or hourly
sed -i 's/^OnCalendar=.*/OnCalendar=daily/' /usr/lib/systemd/system/logrotate.timer
# sed -i 's/^OnCalendar=.*/OnCalendar=hourly/' /usr/lib/systemd/system/logrotate.timer

sudo systemctl daemon-reload
sudo systemctl restart logrotate.timer

The default DGX-OS image, installed by the BCM 11 installer or through the cm-create-image command, contains all the software required to operate a DGX GB200 NVL72 system. However, the default configuration might not be optimized for your specific cluster architecture.

This section describes additional modifications you can apply to the OS image to better support the rest of the software stack.

Use the following steps to further configure your DGX-OS image for use in a GB200 NVL72 cluster:

  1. Locate your DGX-OS image and enter the chroot

    cm-chroot /cm/images/dgxos-7.2-image
    
  2. Once inside the chrooted image, run this command to create a script within the chroot.

    cat > /tmp/dgx-post-install.sh <<EOF
    #!/bin/bash
    
    set -e
    set -x
    
    
    setup_cleanup() {
      apt update && apt install -y python3-xmltodict
      systemctl disable nvidia-fabricmanager # RC10 issue
      rm -f /etc/apt/sources.list.d/doca.list
    }
    
    setup_dcgm() {
      systemctl enable nvidia-dcgm-exporter
      echo "DCGM_EXP_XID_ERRORS_COUNT, gauge, Count of XID Errors within user-specified time window (see xid-count-window-size param)." >> /etc/dcgm-exporter/default-counters.csv
    }
    
    setup_ib() {
      systemctl enable openibd
      rm -f /etc/libibverbs.d/vmw_pvrdma.driver
    }
    
    setup_slurm() {
      # These modifications are documented in the NMC guide for installing slurm.
      # if the image isn't actually a slurm worker image, this does not hurt.
      cat > /etc/sysconfig/slurmd <<EOT
    PMIX_MCA_ptl=^usock
    PMIX_MCA_psec=none
    PMIX_SYSTEM_TMPDIR=/var/empty
    PMIX_MCA_gds=hash
    EOT
    }
    
    setup_tuning() {
      cat > /etc/sysctl.d/99-sysctl.conf  <<EOT
    net.ipv4.conf.all.rp_filter=1
    net.ipv4.conf.default.rp_filter=1
    net.ipv4.conf.enP6p3s0f0np0.rp_filter=1
    net.ipv4.conf.lo.rp_filter=1
    EOT
    }
    
    disable_nvsm_efi_check {
      sed -i '/Status of volumes/s/^[[:space:]]*#//' /cm/local/apps/cmd/scripts/healthchecks/configfiles/nvsm_show_health.py
    }
    
    setup_cleanup
    setup_ib
    setup_dcgm
    setup_slurm
    setup_tuning
    disable_nvsm_efi_check
    
    EOF
    
  3. Make the script executable using the following command

    chmod +x /tmp/dgx-post-install.sh
    
  4. Run the following script

    /tmp/dgx-post-install.sh
    
  5. Then Exit the chroot using the following command:

    exit