Upgrading

This section provides information about upgrading an existing DGX OS installation.

If you want to reimage the system with DGX OS to a default state, refer to Reimaging for more information.

Important

Before you upgrade a system or any installed software, always consult the Release Notes for the latest information about available upgrades. You can find out more about the release cadence and release methods for DGX OS in Release Guidance

Here is some information that describes the difference between the different types of upgrades:

  • When you perform a release upgrade, you currently have the DGX OS 4 installed, and you want to move to DGX OS 5.

    You can upgrade to DGX OS 5 only from the latest DGX OS 4.x release (for DGX Station, DGX-2, or DGX-1 systems) or from the latest DGX OS 4.99.x release (for DGX A100 systems). Refer to Performing a Release Upgrade from DGX OS 4 for the upgrade instructions. The instructions also provide information about completing an over-the-internet upgrade.

  • When you perform package upgrades, you want to install upgrades that have been made available in the repositories since the initial DGX OS 5 release. The repositories are periodically updated with packages that include bug fixes and security updates. The NVIDIA repository also includes packages with new features that are available with the latest DGX OS minor version release. Refer to Performing Package Upgrades for instructions.

Note

If you want to change the branch of a driver or CUDA Toolkit, refer to Additional Software for instructions.

Upgrades are cumulative, which means that your systems will install all available upgrades, including upgrades available from Ubuntu, such as the kernel. Performing upgrades will install the latest versions available at the time when the upgrade is performed. These may be newer than the current DGX OS release.

Important

The instructions in this chapter upgrade all software for which updates are available from your configured software sources, including applications that you installed yourself. If you want to prevent an application from being upgraded, you can instruct the Ubuntu package manager to keep the current version.

For more information, refer to the Ubuntu Community Help Wiki: Introduction to Holding Packages. It is typically not advised to hold packages as it can disrupt package dependencies.

Important

When you upgrade DGX OS, the system remains on the installed GPU driver branch. For example, the GPU driver branch on the system does not automatically switch from R450 to R470. Refer to Changing Your GPU Branch for instructions on switching GPU driver branches.

For drivers release that have reached end of support, systems will be transitioned to the next supported driver release.

DGX OS 5 Release Upgrade Advisements

Here is some aditional information when you intend to perform a release upgrade from DGX OS 4:

  • NGC Containers

    With DGX OS 5, customers should update their NGC containers to container release 20.10.17 or later if they are using multi-node training. For all other use cases, refer to the NCG Framework Containers Support Matrix. Refer to the NVIDIA Deep Learning Frameworks documentation for information about the latest container releases and how to access the releases.

  • Linux Kernel-based Virtual Support

    DGX OS 5 does not support the Linux Kernel-based Virtual Mode (KVM) on DGX systems. NVIDIA KVM is available only with DGX-2 systems on DGX OS 4. DGX-2 customers that require this feature should stay with the latest DGX OS 4 release.

Getting Release Information for DGX Systems

Here is some information about how you can determine the release information for your DGX systems.

The /etc/dgx-release file provides release information, such as the product name and serial number. This file also tracks the history of the DGX OS software updates by providing the following information:

  • The version number and installation date of the last version to be installed from an ISO image DGX_SWBUILD_VERSION.

  • The version number and update date of each over-the-network update applied since the software was last installed from an ISO image (DGX_OTA_VERSION).

For DGX OS 5, the DGX_OTA_VERSION file indicates the latest ISO version that was released, and upgrades to the system include the changes that were made in the network repository up to the indicated date. You can use this information to determine whether your DGX system is running the current version of the DGX OS software.

To get release information for the DGX system, view the content of the /etc/dgx-release file. For example:

more /etc/dgx-release

DGX_NAME="DGX Station"
DGX_PRETTY_NAME="NVIDIA DGX Station"
DGX_SWBUILD_DATE="2017-09-18"
DGX_SWBUILD_VERSION="3.1.2"
DGX_COMMIT_ID="15cd1f473bb53d9b64503e06c5fee8d2e3738ece"
DGX_SERIAL_NUMBER=XXXXXXXXXXXXX

DGX_OTA_VERSION="3.1.3"
DGX_OTA_DATE="Wed Nov 15 15:35:25 PST 2017"

DGX_OTA_VERSION="4.7.0"
DGX_OTA_DATE="Fri Dec 19 13:49:06 PST 2020"

DGX_OTA_VERSION="5.0.0"
DGX_OTA_DATE="Tue Jan 19 14:23:18 PDT 2021"

DGX_OTA_VERSION="5.0.0"
DGX_OTA_DATE="Tue Feb 23 17:45:30 PST 2021"

Preparing to Upgrade the Software

This section provides information about the tasks you need to complete before you can upgrade your DGX OS software.

Connect to the DGX System Console

Connect to the console of the DGX system using a direct connection or a remote connection through the BMC. See Connecting to the DGX System

Note

SSH can be used to perform the upgrade. However, if the Ethernet port is configured for DHCP, the IP address might change after the DGX server is rebooted during the upgrade, which results in the loss of connection. A loss of connection might also occur if you are connecting through a VPN. If this happens, connect by using a direct connection or through the BMC to continue the upgrade process. Warning: Connect directly to the DGX server console if the DGX is connected to a 172.17.xx.xx subnet.

DGX OS software installs Docker CE, which uses the 172.17.xx.xx subnet by default for Docker containers. If the DGX server is on the same subnet, you will not be able to establish a network connection to the DGX server.

See Configuring Docker IP Addresses To ensure that your DGX system can access the network interfaces for Docker containers, Docker should be configured to use a subnet distinct from other network resources used by the DGX system. for instructions on how to change the default Docker network settings after performing the upgrade.

If you are using a GUI to connect to the console, see Performing Package Upgrades by Using the GUI You can use the graphical Software Updater application to manage package upgrades on the DGX Station.

Verifying the DGX System Connection to the Repositories

Before you attempt to complete the update, you can verify that the network connection for your DGX system can access the public repositories and that the connection is not blocked by a firewall or proxy.

On the DGX system, enter the following:

wget -O f1-changelogs http://changelogs.ubuntu.com/meta-release-lts
wget -O f2-archive http://archive.ubuntu.com/ubuntu/dists/focal/Release
wget -O f3-security http://security.ubuntu.com/ubuntu/dists/focal/Release
wget -O f4-nvidia-baseos http://repo.download.nvidia.com/baseos/ubuntu/focal/x86_64/dists/focal/Release
wget -O f5-nvidia-cuda https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2004/x86_64/Release

The wget commands should be successful, and there should be five files in the directory with non-zero content.

Rotating the GPG Keys

NVIDIA constantly evaluates and improves security implementations. As part of these improvements, we are rolling out changes to harden the security and reliability of our repositories. These changes require rotating the GPG keys that are used to sign the metadata and packages in those repositories.

Important

This step is only required if you are running DGX OS 5.2 and earlier. If you already have DGX OS 5.3 installed, you can skip this section. Refer to Getting Release Information for DGX Systems to identify the version your system is running.

Rotating the GPG Key For a Default Installation or After Reimaging

This section provides information about how to rotate the GPG keys for a default DGX OS installation from the factory or after you reimage with the DGX OS ISO version 5.2.x and earlier.

  1. Download the new repository setup packages.

    wget https://repo.download.nvidia.com/baseos/ubuntu/focal/x86_64/pool/common/n/nvidia-repo-keys/nvidia-repo-keys_22.04-1_all.deb
    
    wget https://repo.download.nvidia.com/baseos/ubuntu/focal/x86_64/pool/dgx/n/nvidia-repos/dgx-repo_21.07-1_amd64.deb
    
    wget https://repo.download.nvidia.com/baseos/ubuntu/focal/x86_64/pool/common/n/nvidia-repos/cuda-compute-repo_21.07-1_amd64.deb
    
  2. Directly install the .deb packages, which skips the GPG check performed in apt.

    Note

    If prompted, ensure that you accept the maintainer’s version for all files.

    sudo dpkg --force-confnew -i ./nvidia-repo-keys_22.04-1_all.deb ./dgx-repo_21.07-1_amd64.deb ./cuda-compute-repo_21.07-1_amd64.deb
    
  3. Manually revoke the previous DGX and CUDA GPG keys.

    sudo apt-key del 629C85F2
    
    sudo apt-key del 7FA2AF80
    

    OTA updates can now occur as normal.

Rotating the GPG Keys for the DGX Software Stack

This section provides information about how to rotate the GPG keys if you installed Ubuntu and the DGX Software Stack.

  1. Download the updated dgx-repo-files tarball and extract its contents onto the root filesystem.

    curl https://repo.download.nvidia.com/baseos/ubuntu/focal/dgx-repo-files.tgz | sudo tar xzf - -C
    
  2. Manually revoke the previous DGX and CUDA GPG keys.

    sudo apt-key del 629C85F2
    
    sudo apt-key del 7FA2AF80
    

    OTA updates can now occur as normal.

Performing a Release Upgrade from DGX OS 4

Here you find information on performing a release upgrade from DGX OS 4 to DGX OS 5.

Important

If installed software packages do not have upgrade candidates, and you try to upgrade, an error message will be displayed. You need to use the --force option, and upgrade process. Refer to the Release Notes for a list of packages that are no longer available in DGX OS 5.

Upgrade DGX OS 4 to the Latest Version

Before you can perform the release upgrade of your system, you need to upgrade the current DGX OS 4 to the latest version. These steps upgrade your system to the latest DGX OS 4 release:

  1. If you have DGX OS 4.12 or earlier installed or version 4.99.x on DGX A100, ensure you have the correct GPG signing keys installed on your system. Before you continue the upgrade, refer to the following release notes for instructions and details: Rotating the GPG Keys

  2. Download information from all configured sources about the latest versions of the packages.

    sudo apt update
    
  3. Install all available upgrades for your current DGX OS release.

    sudo apt -y full-upgrade
    

Note

Depending on which packages were updated when running sudo apt -y full-upgrade, you might be prompted to reboot the system before performing nvidia-release-upgrade

Performing the Release Upgrade

Follow these steps to upgrade your system from DGX OS 4 to DGX OS 5:

  1. Install the nvidia-release-upgrade package for upgrading to the latest DGX OS 4 release.

    sudo apt install -y nvidia-release-upgrade
    
  2. Issue the following only if upgrading from DGX OS 4.99.

    sudo apt install -y nvidia-fabricmanager-450/bionic-updates
    --allow-downgrades
    
  3. Start the DGX OS release upgrade process.

    sudo nvidia-release-upgrade
    

    If you are using a proxy server, add the -E option to keep your proxy environment variables. For example:

    sudo -E nvidia-release-upgrade
    

    Note

    This step upgrades all driver versions that are currently supported in DGX OS 4 (R410, R418, and R450) to version R470.

    Note

    Some package upgrades require that you reboot the system before completing the upgrade. Ensure that you reboot the system when prompted.

  4. Resolve conflicts.

    Refer to Resolving Release Upgrade Conflicts for details and instructions.

  5. Wait for the upgrade process to complete, and press return or n at the prompt that appears when the system upgrade is completed.

    System upgrade is complete. Restart required To finish the upgrade, a
    restart is required. If you select 'y' the system will be restarted.
    Continue [yN]
    

    Check if nvidia-fabricmanager-<version>, libnvidia-nscq-<version>, or both packages need to be updated by running the following command.

    dpkg -l | grep -E -i  'nscq|fabricmanager'
    
    ii  libnvidia-nscq-450             450.248.02-0ubuntu0.20.04.1       amd64        NVSwitch Configuration and Query library
    ii  nvidia-fabricmanager-450       450.248.02-0ubuntu0.20.04.1       amd64        Fabric Manager for NVSwitch based systems.
    

    If nvidia-fabricmanager-<version>, libnvidia-nscq-<version>, or both packages are installed, compare the installed versions with the NVIDIA Server Driver. For example:

    dpkg -l | grep 'NVIDIA Server Driver metapackage'
    
    ii nvidia-driver-535-server 535.161.07-0ubuntu0.20.04.1 amd64 NVIDIA Server Driver metapackage
    

    If the NVIDIA Server Driver version is newer than any of the two packages, the packages must be updated to the same major version as the NVIDIA Server Driver. For example,

    apt install nvidia-fabricmanager-535 libnvidia-nscq-535
    

    The system must be restarted to complete the update process and ensure that any changes are captured by restarted services and runtimes.

    Restart the system to complete the update process.

    sudo reboot
    

After the system is restarted, the upgrade process takes several minutes to perform some final installation steps.

Resolving Release Upgrade Conflicts

During the upgrade, the system might encounter conflicts or require other manual intervention.

  • When you are prompted to resolve conflicts in configuration files, evaluate the changes before selecting one of the following options:

    • Accepting the maintainer’s version.

    • Keeping the local version.

    • Manually resolving the difference.

    Conflicts in some configuration files might be the result of customizations to the Ubuntu Desktop OS made for DGX OS software. For guidance about how to resolve these conflicts, see the chapter in the Release Notes for the release family to which you are upgrading.

    • /etc/apt/sources.list.d/dgx.list. You should install the package maintainer’s version.

    • /etc/ssh/sshd_config. You can keep the local version that is currently installed.

    Conflicts in the following configuration files are the result of customizations to the Ubuntu Desktop OS made for DGX OS 5.

    • /etc/gdm3/custom.conf.distrib. You can keep your currently installed version.

    • /etc/gdm3/custom.conf. You can keep your currently installed version.

  • If you have packages that do not have upgrade candidates, you will see the following message:

    WARNING: The following packages are installed, but have no 20.04 upgrade path. They will be uninstalled during the release upgrade process. libnccl2 libnccl-dev libcudnn7 libcudnn7-dev libcudnn7-doc libcudnn8 libcudnn8-dev libcudnn8-samples The --force option must be used to proceed.

    If you see this message, run the nvidia-release-upgrade command with the --force option.

  • If you are logged in to the DGX system remotely through secure shell (SSH), you are prompted about whether you want to continue running under SSH.

    Continue running under SSH?
    This session appears to be running under ssh. It is not recommended to perform a upgrade over ssh currently because in case of failure it is harder to recover.
    If you continue, an additional ssh daemon will be started at port '1022'.
    Do you want to continue?
    Continue [yN]
    
    • Enter y to continue.

    • An additional sshd daemon is started and the following message is displayed:

      Starting additional ``sshd`` To make recovery in case of failure easier, an
      additional sshd will be started on port '1022'. If anything goes wrong
      with the running ssh you can still connect to the additional one. If you
      run a firewall, you may need to temporarily open this port. As this is
      potentially dangerous it's not done automatically. You can open the port
      with e.g.: 'iptables -I INPUT -p tcp --dport 1022 -j ACCEPT' To continue
      please press [ENTER]
      
    • Press Enter.

  • If you are warned that third-party sources are disabled:

    Third party sources disabled
    Some third party entries in your sources.list were disabled. You can re-enable them after the upgrade with the 'software-properties' tool or your package manager.
    To continue please press **ENTER**
    

    Canonical and DGX repositories are preserved for the upgrade, but any other repositories, for example, Google Chrome or VSCode, will be disabled. After the upgrade, you must manually re-enable any third-party sources that you want to keep.

    • Press Enter.

  • You are asked to confirm that you want to start the upgrade.

    Do you want to start the upgrade?
    Installing the upgrade can take several hours. Once the download has finished, the process cannot be canceled.
    Continue [yN] Details [d]
    
    • Press Enter.

  • (DGX Station only) In response to the warning that lock screen is disabled, press Enter to continue. Do not press Ctrl+C to respond to this warning, because pressing Ctrl+C terminates the upgrade process.

  • If you are prompted to confirm that you want to remove obsolete packages, select one of the options:

    Remove obsolete packages?
    371 packages are going to be removed. Removing the packages can take several hours.
    Continue [yN]   Details [d]
    
    - Determine whether to remove obsolete packages and continue with the
      upgrade.
    
      - Review the list of packages that will be removed.
    
         To identify obsolete DGX OS Desktop packages, see the lists of obsolete
         packages in the `DGX OS Desktop Release
         Notes <https://docs.nvidia.com/dgx/dgx-os-desktop-release-notes/index.html>`__
         for all releases after your current release.
    
      - If the list contains only packages that you want to remove, enter
         **y** to continue with the upgrade.
    
    • Enter y to accept the recommended changes, n (default) for no, or d for more details.

Verifying the Upgrade

Here is are steps to verify your upgrade.

  1. Confirm the Linux kernel version.

    For example, when you upgrade to DGX OS 5.0, the Linux kernel version is at least 5.4.0-52-generic.

  2. For the minimum Linux kernel version of the release to which you are upgrading, refer to the release notes for that release.

  3. Confirm the NVIDIA Graphics Drivers for Linux version.

    nvidia-smi
    

    For example, for an upgrade to DGX OS Desktop 5.0, the NVIDIA Graphics Drivers for Linux version is at least 450.80.02:

    Tue Oct 13 09:02:14 2022
    +-----------------------------------------------------------------------------+
    | NVIDIA-SMI 450.80.02    Driver Version: 450.80.02    CUDA Version: 11.0     |
    |-------------------------------+----------------------+----------------------+
    

Recovering from an Interrupted or Failed Update

If the script is interrupted during the update, because of a loss of power or loss of network connection, depending on the issue, you need to restore power or restore the network connection.

If the system encounters a kernel panic after you restore power and reboot the DGX system, you cannot perform the over-the-network update. You need to reinstall DGX OS 5 with the latest image instead. See Reimaging.

This section provides information about how to install the DGX OS for instructions and complete the network update.

If you can successfully return to the Linux command line, complete the following steps.

  1. Reconfigure the packages.

    dpkg -a -configure
    
  2. Fix the broken package installs.

    apt -f install -y
    
  3. Determine where the release-upgrader was extracted.

    /tmp/ubuntu-release-upgrader-<random-string>
    
  4. Start a bash shell, go to the upgrader, and configure.

    sudo bash
    
    cd /tmp/ubuntu-release-upgrader-<random-string>
    
    RELEASE_UPGRADER_ALLOW_THIRD_PARTY=1 ./focal --frontend=DistUpgradeViewText
    

    Do not reboot at this time.

  5. Issue the following command and reboot.

    bash /usr/bin/nvidia-post-release-upgrade
    
    reboot
    

Performing Package Upgrades

NVIDIA and Canonical provide updates to the OS in the form of updated software packages between releases with security mitigations and bug fixes. You should evaluate the available updates in regular intervals and update the system that is based on the threat level.

Enabling Extended Security Maintenance Upgrades

This section provides information about Ubuntu’s Extended Security Updates (ESM).

As a DGX OS customer, you are entitled to Extended Security Updates from the Ubuntu Universe repository.

You may see the following Ubuntu Pro message from ubuntu-advantage-tools during an apt upgrade if security updates are available for packages from the Ubuntu Universe repository:

Get more security updates through Ubuntu Pro with 'esm-apps' enabled.
Learn more about Ubuntu Pro at https://ubuntu.com/pro.

In addition, DGX users will also get the following NVIDIA message:

Your DGX contract entitles you to Extended Security Maintenance updates
for additional packages in the Ubuntu repository. Please
contact NVIDIA Support to get your key to enable this capability.”

After contacting NVIDIA Enterprise Support to obtain an Ubuntu Pro token, you can use the token with the following command to enable Extended Security Maintenance updates:

sudo pro attach XXXXX

Ubuntu Pro subscription can be checked with the sudo pro status command:

sudo pro status

Performing Package Upgrades Using the CLI

You should evaluate the available updates in regular intervals and update the system based on the threat level:

  • Refer to the Ubuntu Wiki Upgrades for more information about upgrades available for Ubuntu.

  • For a list of the known Common Vulnerabilities and Exposures (CVEs), including those that can be resolved by updating the DGX OS software, refer to the Ubuntu Security Notices

If updates are available, you can obtain upgraded packages by completing the following steps:

  1. If you have a DGX OS version earlier than 5.3, ensure you have the correct GPG signing keys installed no your system. Before you continue the upgrade, refer to Rotating the GPG Keys for instructions and details.

  2. Update the internal database with the list of available packages and their versions.

    sudo apt update
    
  3. Review the packages that will be upgraded.

    sudo apt full-upgrade -s
    

    To prevent an application from being upgraded, you can instruct the Ubuntu package manager to “hold packages”. Refer to Holding Packages for more information.

    Note

    Holding packages should only be used in extreme rare cases as it can disrupt package dependencies.

  4. Install the updated CUDA repo preferences package to ensure the Fabric Manager and NSCQ library are installed from the Canonical repo.

    sudo apt install cuda-compute-repo
    
  5. Upgrade to the latest version.

    sudo apt full-upgrade
    

    When prompted to resolve an issue, answer any questions that appear. Most questions require a Yes or No response.

    • When prompted to select which the GRUB configuration to use, select the current one on the system.

    • When prompted to select the GRUB install devices, keep the default selection.

    • The other questions will depend on what other packages were installed before the update, and how those packages interact with the update.

    • If a message appears that indicates that the nvidia-docker.service failed to start, you can disregard it and continue with the next step. The service will start at that time.

  6. When the upgrade is complete, reboot the system.

    sudo reboot
    

Note

Upgrades to the NVIDIA Graphics Drivers for Linux requires a restart to complete the kernel upgrade. If you upgrade the NVIDIA Graphics Drivers for Linux without restarting the DGX system, when you run the nvidia-smi command, an error message is displayed.

nvidia-smi
Failed to initialize NVML: Driver/library version mismatch

Managing Software Upgrades on DGX Station

This section provides information about managing upgrades between DGX OS releases by using a GUI tool on DGX Station.

Performing Package Upgrades Using the GUI

You can use the graphical Software Updater application to manage package upgrades on the DGX Station.

Ensure that you are logged in to your Ubuntu desktop on the DGX Station as an administrator user.

  1. Press the Super key.

    This key is usually found on to the Alt key. Refer to What is the Super key? for more information.

    • If you are using a Windows keyboard, the Super key usually has a Windows logo on it, and it is sometimes called the Windows key or system key.

    • If you are using an Apple keyboard, this key is known as the Apple key.

  2. In the search bar, type Software Updater

  3. Open the Software Updater, review the available updates, and click [Install Now].

    Screen capture showing the software updater window.

    Screen capture showing the software updater window.

    • If no updates are available, the Software Updater informs you that your software is up to date.

    • If an update requires the removal of obsolete packages, you will be warned that not all updates can be installed.

    To continue with the update, complete the following steps:

    1. Click [Partial Upgrade].

    2. Review the list of packages that will be removed. To identify obsolete DGX Station packages, see the lists of obsolete packages in the DGX OS Desktop Release Notes for all releases after your current release.

    3. If the list contains only packages that you want to remove, click [Start Upgrade].

  4. When prompted to authenticate, type your password into the [Password] field and click [Authenticate].

  5. When the update is complete, restart DGX Station.

Restart the system even if you are not prompted to restart it to complete the updates. Any update to the NVIDIA Graphics Drivers for Linux requires a restart. If you update the NVIDIA Graphics Drivers for Linux without restarting the DGX Station, running the nvidia-smi command displays an error message.

nvidia-smi
Failed to initialize NVML: Driver/library version mismatch

Checking for Updates to DGX Station Software

In Software & Updates, you can change your settings to automatically check for package updates and to configure updates from the Ubuntu software repositories. You can also configure your DGX Station to notify you of important security updates more frequently than other updates.

In the following example, the DGX Station is configured to check for updates daily, to display important security updates immediately, and to display other updates every two weeks.

Screen capture showing the options in the Updates tab of Ubuntu Software & Updates window to check for updates daily, to display important security updates immediately, and to display other updates every two weeks.

_images/software-and-updates-updates.png