Software Update#

  1. Clone the category in BCM

    cmsh
    category
    clone <dgx-gb200> <new-dgx-gb200>
    commit
    
  2. Clone the OS image

    cmsh
    softwareimage
    clone <dgx-image> <new-dgx-image>
    commit
    

    Set the new Category to the new image

    cmsh
    category
    set <new-dgx-gb200> softwareimage <new-dgx-image>
    commit
    
  3. Enter the Image to make changes

    cm-chroot /cm/images/new-dgx-image/
    
  4. Create DOCA Repo based on Architecture

    X86:

    dd status=none of=/etc/apt/sources.list.d/doca.sources << EOF
    Types: deb
    URIs: https://linux.mellanox.com/public/repo/doca/baseos8-latest/ubuntu24.04/x86_64/
    Suites: /
    Signed-By: /usr/share/keyrings/GPG-KEY-Mellanox.gpg
    EOF
    

    arm64:

    dd status=none of=/etc/apt/sources.list.d/doca.sources << EOF
    Types: deb
    URIs: https://linux.mellanox.com/public/repo/doca/baseos8-latest/ubuntu24.04/arm64-sbsa/
    Suites: /
    Signed-By: /usr/share/keyrings/GPG-KEY-Mellanox.gpg
    EOF
    
  5. Install the latest DGX OS packages

    Compatible drivers and software packages need to be installed to align with the new firmware.

    For a detailed DGX OS update guide, please refer to https://docs.nvidia.com/dgx/dgx-os-7-user-guide/upgrading-the-os.html#performing-package-upgrades-using-the-cli

    Also note the Known issue workaround captured here: https://docs.nvidia.com/dgx/dgx-os-7-user-guide/known_issues.html#dgx-gb200-system-failure-during-upgrade

    # 1. Update the internal database with the list of available packages and their versions
    apt update
    
    # 2. Review the packages that will be upgraded
    apt full-upgrade -s
    
    # 3. Upgrade to the latest version
    apt full-upgrade
    
    # 4. Re-run DKMS build with the --force option against the newly installed kernel (from Step 2)
    sudo dkms autoinstall --force -k <New Installed kernel>
    
    # 5. Re-configure broken packages
    sudo apt -f install -y
    

    Note

    This does not update the BCM Kernel in use.

    Install MFT, DOCA, NVIDIA driver packages:

    # Make sure the external repo is pointed to for DOCA packages
    cat /etc/apt/sources.list.d/doca.sources
    
    # Expected output:
    # Types: deb
    # URIs: https://linux.mellanox.com/public/repo/doca/DGX_GBxx_latest_DOCA/ubuntu24.04/arm64-sbsa/
    # Suites: /
    # Signed-By: /usr/share/keyrings/GPG-KEY-Mellanox.gpg
    
    # Install DOCA package
    sudo apt-get update
    sudo apt install doca-all
    
    # Install driver package
    sudo dpkg -i nvidia-driver-local-repo-ubuntu2404-570.158.01_1.0-1_arm64.deb
    sudo cp /var/nvidia-driver-local-repo-ubuntu2404-570.158.01/nvidia-driver-local-5778B6CA-keyring.gpg /usr/share/keyrings/
    sudo mv /etc/apt/sources.list.d/cuda-compute-repo.sources /etc/apt/sources.list.d/cuda-compute-repo.sources.disabled
    sudo apt update
    sudo apt install nvidia-driver-570-open
    sudo apt-get install nvidia-imex-570
    sudo apt-get install nvidia-fabricmanager-570
    sudo apt-get install libnvidia-nscq-570
    

    Verify installations:

    # Check DOCA packages
    sudo dpkg -l | grep <Expected DOCA Ver>
    
    # Check driver package
    sudo dpkg -l | grep <Expected Driver ver>
    
  6. Save changes into the image

    exit
    
  7. Set compute node to DGX Category

    cmsh
    device
    foreach -n dgx-nodes[XX-XX] (set category <new-dgx-gb200>)
    commit
    
  8. Reboot compute nodes

    reboot -c <new-dgx-gb200>
    
  9. Verify all components have been upgraded