DGX Station User Guide

Documentation for users and administrators that explains how to install, set up, and maintain the DGX Station.

About this Guide

DGX Station User Guide explains how to install, set up, and maintain the NVIDIA® DGX Station™.

This guide is aimed at users and administrators who are familiar with the Ubuntu Desktop Linux OS, including use of the command line and the sudo command.

Note: The instructions in this guide for software administration apply only to the DGX OS Desktop. They don't apply if the DGX OS Desktop software that is supplied with the DGX Station has been replaced with the DGX software for Red Hat Enterprise Linux or CentOS.

For additional information to help you use the DGX Station, see the following table.

Task Additional Information
Use the Ubuntu Desktop Linux OS
Find out about the DGX OS Desktop software for the DGX Station DGX OS Desktop Release Notes
Use the DGX Station to download and run containers for deep learning frameworks NGC Container Registry for DGX User Guide
Use deep learning frameworks optimized for NVIDIA DGX systems NVIDIA Deep Learning Frameworks Documentation
Use the tools and libraries in the DGX OS Desktop for development of deep learning frameworks NVIDIA Deep Learning SDK Documentation

1. Introduction to the NVIDIA® DGX Station™

The NVIDIA DGX Station is a fast, multi-GPU workstation for deep learning and AI analytics. You can use the DGX Station to run neural networks, and deploy deep learning models. Because the DGX Station is software compatible with the NVIDIA DGX-1 server, you can also use the DGX Station to optimize applications to run on a production DGX-1 cluster.



Photograph showing the front and the side of the DGX Station.

1.1. What's in the Box

  • DGX Station
  • Accessory boxes containing:
    • Quick Start Guide
    • AC power cable
    • 3 DisplayPort™ 1.2 to HDMI 2.0 adapters
    • USB recovery flash drive containing a backup copy of the operating system image and CUDA toolkit
    • DVD-ROM containing source code of open-source software installed on the DGX Station
    • Toxic Substance Notice and Safety Instructions
    • Declaration of Conformity
    • Repacking Instructions/Intra-Transit

Inspect each piece of equipment in the packing box. If anything is missing or damaged, contact your supplier.

1.2. DGX OS Desktop Software Summary

The DGX OS Desktop software that is supplied with the DGX Station includes the software that you need for downloading and running containers for deep learning frameworks. The software is already installed on the DGX Station, except where licensing requirements mandate that the software be supplied separately. Any software that must be supplied separately is installed automatically when the DGX Station is first powered on.

For details about the DGX OS Desktop software, refer to DGX OS Desktop Release Notes.

Note:

You can replace the DGX OS Desktop software that is supplied with the DGX Station by installing the DGX software for Red Hat Enterprise Linux or CentOS. For instructions, see:

1.3. DGX Station Hardware Summary

Processors

Component Qty Description
CPU 1 Intel Xeon E5-2698 v4 2.2 GHz (20-Core)
GPU - current units 4 NVIDIA Tesla® V100-DGXS-32GB with 32 GB per GPU (128 GB total) of GPU memory
GPU - earlier units 4 NVIDIA Tesla V100-DGXS-16GB with 16 GB per GPU (64 GB total) of GPU memory

System Memory and Storage

Component Qty Unit Capacity Total Capacity Description
System memory 8 32 GB 256 GB ECC Registered RDIMM DDR4 SDRAM
Note: You can replace all eight factory-installed 32-GB DIMMs with 64-GB DIMMs to give a total capacity of 512 GB.
Data storage 3 1.92 TB 5.76 TB 2.5" 6 Gb/s SATA III SSD in RAID 0 configuration
Note:Since DGX OS Desktop 4.4.0 or DGX software for Red Hat Enterprise Linux or CentOS EL7-20.02: You can add four 1.92-TB SSDs for data storage to give a total capacity of 13.44 TB in a RAID 0 configuration.
OS storage 1 1.92 TB 1.92 TB 2.5" 6 Gb/s SATA III SSD

2. Setting Up the NVIDIA DGX Station

Before using the DGX Station, ensure that its initial set-up is complete.

2.1. Siting the DGX Station

CAUTION:

The DGX Station weighs 88 lbs (40 kg). Do not attempt to lift the DGX Station. Instead, remove the DGX Station from its packaging and move it into position by rolling it on its fitted casters.

To prevent damage to components inside the DGX Station, do not subject the DGX Station to excessive vibration or mechanical shock. After moving or transporting the DGX Station, visually inspect the NVLINK bridge, which connects the GPUs, and the drive trays in the drive cage to see if they have shifted out of position. If any of these components has shifted, reseat the component before operating the DGX Station.

Site the DGX Station in a location that is clean, dust-free, well ventilated, and near an appropriately rated, grounded AC power outlet.

Leave approximately 5" (12.5 cm) of clearance behind and at the sides of the DGX Station to allow sufficient airflow for cooling the unit.

When operating the DGX Station, keep the ambient temperature and relative humidity within the following ranges:

  • Ambient temperature:10°C to 30°C (50°F to 86°F)

  • Relative humidity:10% to 80% (non-condensing)

Always keep the DGX Station upright. Do not lay the unit on its side.



Line drawing showing the DGX Station upright with a check mark to indicate that this position is correct and laid flat with a cross mark to indicate that this position is incorrect.

2.2. Removing or Replacing the Packing Inside the DGX Station

To prevent damage to components inside the DGX Station during transit, a foam packing piece is packed inside the DGX Station. Before you connect and power on the DGX Station, you must remove this packing piece from inside the DGX Station. If you are returning the DGX Station to NVIDIA under a return merchandise authorization (RMA), replace this packing piece before repacking the DGX Station.
Before you begin, ensure that:
  • The DGX Station is shut down and powered off.
  • The power cable, all communications cables, and any peripheral devices such as displays and keyboards are disconnected from the DGX Station.
  1. Push the button on the right side of the DGX Station back panel to release the side panel on the right of the DGX Station when viewed from the rear.

    Line drawing showing the button on the right side of the DGX Station back panel being pushed.

  2. Lift the panel to remove it.

    Line drawing showing the DGX Station side-panel being removed.

    CAUTION:
    To prevent damage from electrostatic discharge, avoid touching any of the components inside the DGX Station.
  3. Remove or replace the foam packing piece that surrounds the GPU cards inside the DGX Station.
    • To remove the foam packing piece, gently grasp it and pull it towards you.

      If you are unpacking an advance-shipped replacement for a unit that you are returning to NVIDIA under an RMA, retain this foam packing piece with all other DGX Station packaging. You will need the packaging to repack your original DGX Station for shipment to NVIDIA.

    • To replace the foam packing piece, gently push it into position around the GPU cards inside the DGX Station.



    Line drawing showing the foam packing piece being removed from DGX Station.

  4. Align the bottom edge of the side panel with the bottom edge of the DGX Station.

    Line drawing showing the side-panel being aligned with the bottom edge of the DGX Station.

  5. Firmly push the panel back into place to re-engage the latches.

    Line drawing showing the DGX Station side-panel latches being re-engaged.

2.3. Connecting and Powering on the DGX Station

To complete this task you need the following items, which are not supplied with the DGX Station:

  • Display with power cable and connector cable terminated in a DisplayPort™ connector or HDMI connector

    If your display connector cable is terminated in an HDMI connector, you can use one of the supplied adapters to connect the cable to the DGX Station.

  • USB keyboard
  • USB mouse
  • Ethernet cable
  1. Connect a display to any DisplayPort connector and a keyboard and mouse to any two USB ports.

    Line drawing showing display and keyboard connections to the DGX Station.

    Note: For initial setup, connect only one display to the DGX Station. After you complete the initial Ubuntu OS configuration, you can configure the DGX Station to use multiple displays. For details, see Configuring the DGX Station To Use Multiple Displays.
  2. Use any of the two Ethernet ports to connect the DGX Station to your LAN with Internet connectivity.

    Line drawing showing LAN connections to the DGX Station.

    Note:

    Connect only one Ethernet port on the DGX Station to the Internet unless you plan to configure the ports manually and disable DHCP on at least one of the ports.

    By default, both Ethernet ports on the DGX Station are configured for DHCP. If both the ports are connected simultaneously, each port will get its own IP address. The IP address that the Linux operating system (OS) uses will then alternate between these addresses, causing the OS and applications to malfunction.

  3. Make sure that the power supply rocker switch is in the OFF position.

    Current units:



    Line drawing showing the operation of the DGX Station PSU rocker switch to the OFF position.

    Earlier units:



    Line drawing showing the operation of the DGX Station PSU rocker switch to the OFF position for earlier units.

  4. Connect the supplied power cable from the power socket at the back of the unit to an appropriately rated, grounded AC outlet. For details of the power consumption, input voltage, and current rating of the DGX Station, see Power Specifications.

    Current units:



    Line drawing showing the power cable connection to the DGX Station.

    Earlier units:



    Line drawing showing the power cable connection to the DGX Station for earlier units.

    CAUTION:

    Use only the supplied power cable and do not use this power cable with any other products or for any other purpose. Not all power cables have the same current ratings.

    Do not use household extension cables with your product. Household extension cables do not have overload protection and are not intended for use with computer systems.

  5. Connect the display to a suitable AC outlet and power on the display.
  6. Move the DGX Station power supply rocker switch to the ON position.

    Current units:



    Line drawing showing the operation of the DGX Station PSU rocker switch to the ON position.

    Earlier units:



    Line drawing showing the operation of the DGX Station PSU rocker switch to the ON position for earlier units.

  7. Push the Power button on the front of the unit to power on the DGX Station.

    Line drawing showing the operation of the DGX Station Power push button switch.

2.4. Completing the Initial Ubuntu OS Configuration

When you power on the DGX Station for the first time, you are prompted to accept end user license agreements for NVIDIA software. You are then guided through the process for completing the initial Ubuntu OS configuration.
Note: During the configuration process, to prevent unauthorized users from using non-default boot entries and modifying boot parameters, you need to enter a GRUB password.
  1. Accept the EULA and click Continue.
  2. Select your language, for example, English – English, and click Continue.
  3. Select your keyboard, for example, English (US), and click Continue.
  4. Select your location, for example, Los Angeles, and click Continue.
  5. Enter your username and password, enter the password again to confirm it, and click Continue. Here are some requirements to remember:
    • The username must be composed of lower-case letters.
    • The username will be used instead of the root account for administrative activities.
    • It is also used as the GRUB username.
    • Ensure you enter a strong password.

      If the password that you entered is weak, a warning appears.

  6. Enter the GRUB password and click OK.
    • Your GRUB password must have at least 8 characters.

      If it has less than 8 characters, you cannot click Continue.

    • If you do not enter a password, GRUB password protection will be disabled.
  7. If you performed the automated encryption install, you will also be prompted to create a new passphrase for your root filesystem.
    • The default password was seeded with nvidia3d, and the password that you enter will be disabled when you complete this step.
    • This new passphrase will be used to unlock your root filesystem when the system boots.

After the Ubuntu OS configuration is complete, you can log in to the DGX Station to access your Ubuntu desktop.

Note:Updates to the DGX Station software might have been made available after your DGX Station was manufactured. To ensure that you have the latest DGX Station software, including security updates, check for updates and install any available updates before using your DGX Station. For more information, see Upgrading Within the Same DGX OS Desktop Major Release.

2.6. Registering Your DGX Station

To obtain support for your DGX Station, follow the instructions for registration in the Entitlement Certification email that was sent as part of the purchase.

Registration allows you to access the NVIDIA Enterprise Support Portal, obtain technical support, get software updates, and set up an NGC for DGX systems account. If you did not receive the information, open a case with the NVIDIA Enterprise Support Team at https://www.nvidia.com/en-us/support/enterprise/.

2.7. Configuring the DGX Station To Use Multiple Displays

One of the NVIDIA Tesla V100 GPU cards in the DGX Station provides three DisplayPort connectors, enabling you to connect up to three displays to the DGX Station. If you want to use more than one display with the DGX Station, configure it to use multiple displays after you complete the initial Ubuntu OS configuration.

  1. Connect the displays that you want to use to the DisplayPort connectors at the rear of the DGX Station.

    Each display is automatically detected as you connect it.



    Screen capture showing the DGX OS Desktop when two displays are connected to the DGX Station.

  2. Optional: If necessary, adjust the display configuration, such as switching the primary display, or changing monitor positions or orientation.
    1. Open the Displays window.
      • DGX OS Desktop 4 releases: Open the Ubuntu system menu at the right of the desktop menu bar, click the tools icon, and in the Settings window that opens, choose Devices > Displays.
      • DGX OS Desktop 3 releases: From the Ubuntu system menu at the right of the desktop menu bar, choose System Settings and in the System Settings window that opens, click Displays.
    2. In the Displays window that opens, make the changes to the display settings that you want and click Apply.

      Screen capture showing the Ubuntu Displays window.

High-resolution displays consume a large quantity of GPU memory. If you have connected three 4K displays to the DGX Station, they may consume most of the GPU memory on the NVIDIA Tesla V100 GPU card to which they are connected, especially if you are running graphics-intensive applications.

If you are running memory-intensive compute workloads on the DGX Station and are experiencing performance issues, consider conserving GPU memory by reducing or minimizing the graphics workload.

  • To reduce the graphics workload, disconnect any additional displays you connected and use only one display with the DGX Station.

    If you disconnect a display from the DGX Station, the disconnection is automatically detected and the display settings are automatically adjusted for the remaining displays.

  • To minimize the graphics workload, shut down the display manager and use secure shell (SSH) to log in to the DGX Station remotely.

    • DGX OS Desktop 4 releases: To shut down the GNOME display manager, type the following command:

      $ sudo telinit 3
    • DGX OS Desktop 3 releases: To shut down the LightDM display manager, type the following command:

      $ sudo service lightdm stop

    To start the display manager, log in to the DGX Station remotely and type the command for your DGX OS Desktop release:

    • DGX OS Desktop 4 releases:
      $ sudo telinit 5
    • DGX OS Desktop 3 releases:
      $ sudo service lightdm start

2.9. Preparing the DGX Station for Use with Docker

Some initial setup of the DGX Station is required to ensure that users have the required privileges to run Docker containers and to prevent IP address conflicts between Docker and the DGX Station.

Note: For more information about setting filesystem quotas, see How To Set Filesystem Quotas on Ubuntu 18.04. By default, DGX OS provides steps 1 and 2, so start with step 3.

2.9.1. Enabling Users To Run Docker Containers

To prevent the docker daemon from running without protection against escalation of privileges, the Docker software requires sudo privileges to run containers. Meeting this requirement involves enabling users who will run Docker containers to run commands with sudo privileges. Therefore, you should ensure that only users whom you trust and who are aware of the potential risks to the DGX Station of running commands with sudo privileges are able to run Docker containers.

Before allowing multiple users to run commands with sudo privileges, consult your IT department to determine whether you would be violating your organization's security policies. For the security implications of enabling users to run Docker containers, see Docker daemon attack surface.

You can enable users to run the Docker containers in one of the following ways:

  • Add each user as an administrator user with sudo privileges.

  • Add each user as a standard user without sudo privileges and then add the user to the docker group. This approach is inherently insecure because any user who can send commands to the docker engine can escalate privilege and run root-user operations.

    To add an existing user to the docker group, run this command:

    $ sudo usermod -aG docker user-login-id
    user-login-id
    The user login ID of the existing user that you are adding to the docker group.

2.9.2. Preventing IP Address Conflicts Between Docker and the DGX Station

To ensure that the DGX Station can access the network interfaces for Docker containers, configure the containers to use a subnet distinct from other network resources used by the DGX Station. By default, Docker uses the 172.17.0.0/16 subnet. If addresses within this range are already used on the DGX Station network, change the Docker network to specify the bridge IP address range and container IP address range to be used by Docker containers.

This task requires sudo privileges.
  1. Open the /etc/systemd/system/docker.service.d/docker-override.conf file in a plain-text editor, such as vi.
    $ sudo vi /etc/systemd/system/docker.service.d/docker-override.conf
  2. Append the following options to the line that begins ExecStart=/usr/bin/dockerd, which specifies the command to start the dockerd daemon:
    • --bip=bridge-ip-address-range
    • --fixed-cidr=container-ip-address-range
    bridge-ip-address-range
    The bridge IP address range to be used by Docker containers, for example, 192.168.127.1/24.
    container-ip-address-range
    The container IP address range to be used by Docker containers, for example, 192.168.127.128/25.

    This example shows a complete /etc/systemd/system/docker.service.d/docker-override.conf file that has been edited to specify the bridge IP address range and container IP address range to be used by Docker containers.

    [Service]
    ExecStart=
    ExecStart=/usr/bin/dockerd -H fd:// -s overlay2 --default-shm-size=1G --bip=192.168.127.1/24 --fixed-cidr=192.168.127.128/25
    LimitMEMLOCK=infinity
    LimitSTACK=67108864
    
    Note: Starting with DGX OS Desktop release 3.1.4, the option --disable-legacy-registry=false is removed from the Docker CE service configuration file docker-override.conf. The option is removed for compatibility with Docker CE 17.12 and later.
  3. Save and close the /etc/systemd/system/docker.service.d/docker-override.conf file.
  4. Reload the Docker settings for the systemd daemon.
    $ sudo systemctl daemon-reload
  5. Restart the docker service.
    $ sudo systemctl restart docker

2.10. Managing CPU Mitigations

DGX OS Desktop includes security updates to mitigate CPU speculative side-channel vulnerabilities. These mitigations can decrease the performance of deep learning and machine learning workloads.

If your installation of DGX systems incorporates other measures to mitigate these vulnerabilities, such as measures at the cluster level, you can disable the CPU mitigations for individual DGX nodes and thereby increase performance. This capability is available starting with DGX OS Desktop release 4.4.0.

2.10.1. Determining the CPU Mitigation State of the DGX System

If you do not know whether CPU mitigations are enabled or disabled, issue the following.

$ cat /sys/devices/system/cpu/vulnerabilities/* 
  • CPU mitigations are enabled if the output consists of multiple lines prefixed with Mitigation:.

    Example

    KVM: Mitigation: Split huge pages
    Mitigation: PTE Inversion; VMX: conditional cache flushes, SMT vulnerable
    Mitigation: Clear CPU buffers; SMT vulnerable
    Mitigation: PTI
    Mitigation: Speculative Store Bypass disabled via prctl and seccomp
    Mitigation: usercopy/swapgs barriers and __user pointer sanitization
    Mitigation: Full generic retpoline, IBPB: conditional, IBRS_FW, STIBP: conditional, RSB filling
    Mitigation: Clear CPU buffers; SMT vulnerable
    
  • CPU mitigations are disabled if the output consists of multiple lines prefixed with Vulnerable.

    Example

    KVM: Vulnerable
    Mitigation: PTE Inversion; VMX: vulnerable
    Vulnerable; SMT vulnerable
    Vulnerable
    Vulnerable
    Vulnerable: __user pointer sanitization and usercopy barriers only; no swapgs barriers
    Vulnerable, IBPB: disabled, STIBP: disabled
    Vulnerable
    

2.10.2. Disabling CPU Mitigations

CAUTION:
Performing the following instructions will disable the CPU mitigations provided by the DGX OS Desktop software.
  1. Install the nv-mitigations-off package.
    $ sudo apt install nv-mitigations-off -y
  2. Reboot the system.
  3. Verify CPU mitigations are disabled.
    $ cat /sys/devices/system/cpu/vulnerabilities/*
    The output should include several Vulnerable lines. See Determining the CPU Mitigation State of the DGX System for example output.

2.10.3. Re-enabling CPU Mitigations

  1. Remove the nv-mitigations-off package.
    $ sudo apt purge nv-mitigations-off
  2. Reboot the system.
  3. Verify CPU mitigations are enabled.
    $ cat /sys/devices/system/cpu/vulnerabilities/*
    The output should include several Mitigations lines. See Determining the CPU Mitigation State of the DGX System for example output.

3. Upgrading DGX OS Desktop Software on DGX Station

Updates to DGX OS Desktop software are made available through standard Ubuntu repositories and are available from several sources. You are responsible for upgrading the software on the DGX Station to install the updates from these sources.

Upgrading DGX OS Desktop software can involve upgrading within the same major DGX OS Desktop release or to a new major DGX OS Desktop release.

  • Upgrading within the same major release upgrades between two DGX OS Desktop releases with the same first digit in their release identifiers, for example, from 3.1.6 to 3.1.7. Upgrading within the same major release of DGX OS Desktop upgrades all the packages to the latest version in the repositories for that release.
  • Upgrading to a new major release upgrades between two DGX OS Desktop releases with different first digits in their release identifiers, for example, from DGX OS Desktop 3.1.7 to 4.0.4. Upgrading to a new major DGX OS Desktop release upgrades all the packages to the latest version in the repositories for the new DGX OS Desktop release.

For details about the available updates, see Available DGX Station Package Updates. These updates may contain important security updates. To protect your DGX Station, keep your system up to date with the latest important security updates. For information about security updates for the Ubuntu OS, see Ubuntu Security Notices.

3.1. Upgrading Within the Same DGX OS Desktop Major Release

Perform this task to upgrade between two DGX OS Desktop releases with the same first digit in their release identifiers, for example, from 3.1.6 to 3.1.7. The upgrade process does not update your package sources. Future updates are obtained from repositories for the current major release.

You can use any of the standard means provided by the Ubuntu Desktop OS to upgrade within the same DGX OS Desktop major release. For examples, see: and

CAUTION:
When you use these means to upgrade software on the DGX Station , you upgrade all software for which updates are available from your configured software sources, including applications that you installed yourself. If you want to prevent an application from being upgraded, you can instruct the Ubuntu package manager to keep the current version. For more information, see Introduction to Holding Packages on the Ubuntu Community Help Wiki.

3.3. Opting in to DGX OS Desktop Patch Updates

Patch updates are package upgrades to individual components in DGX OS Desktop software that are delivered through the DGX OS Desktop update repository. You must opt in to receive patch updates. When you opt in, any available patch updates are installed. After you opt in, the distribution for DGX OS Desktop patch updates is added to your configured sources from which packages are obtained.

Ensure that the following prerequisites are met:
  • You are logged in to your Ubuntu desktop on the DGX Station as an administrator user.
  • Your DGX Station is upgraded to DGX OS Desktop release .
  1. Download information from all configured sources about the latest versions of the packages.
    $ sudo apt update
  2. Install the dgxstation-release-updates-repo package.
    $ sudo apt install -y dgxstation-release-updates-repo
    release
    The code name of the Ubuntu OS release on which your DGX OS Desktop release is based. For example, if you are running a DGX OS Desktop 4 release, which is based on Ubuntu 18.04, release is bionic.

    This example installs the dgxstation-bionic-updates-repo.

    $ sudo apt install -y dgxstation-bionic-updates-repo
  3. After installing the dgxstation-release-updates-repo package, download information from all configured sources about the latest versions of the packages again.
    $ sudo apt update
  4. Review the available updates by simulating an upgrade of the packages.
    $ sudo apt -s full-upgrade
  5. Install all available updates for your current DGX OS Desktop release.
    $ sudo apt -y full-upgrade
    Note: Even if the R450 repository is enabled, CUDA 11.0 is not automatically installed. To manually install CUDA 11.0, issue the following command:
    $ sudo apt install -y cuda-toolkit-11-0
  6. When the upgrade is complete, restart your DGX Station.

    Any upgrade to the NVIDIA Graphics Drivers for Linux requires a restart.

    If you upgrade the NVIDIA Graphics Drivers for Linux without restarting the DGX Station, running the nvidia-smi command displays an error message.

    $ nvidia-smi
    Failed to initialize NVML: Driver/library version mismatch
    

3.4. Available DGX Station Package Updates

Updates to DGX Station are made available through standard Ubuntu repositories.

DGX Station is preset to obtain from these repositories updates to the following software:

  • Docker
  • Packages that is exclusive to the DGX Station, including the CUDA Toolkit and CUDA Drivers packages
  • Ubuntu software

For more information about repositories, see Repositories/Ubuntu on the Ubuntu Community Help Wiki.

3.4.1. Updates to Docker and Software Exclusive to the DGX Station

Updates to Docker and to software that is exclusive to the DGX Station, including the CUDA Toolkit and CUDA Drivers packages, are available from a repository maintained by NVIDIA.

CAUTION:
  • Do not obtain updates to the CUDA Toolkit and CUDA Drivers packages from the public CUDA package repository for Ubuntu. Updates from the public repository may be incompatible with the DGX Optimized Frameworks that are available from the NVIDIA® GPU Cloud (NGC) Registry for DGX.
  • Do not obtain updates to Docker from Docker's repositories. NVIDIA Container Runtime for Docker has strict dependencies on the Docker CE version and updates from Docker's repository may cause NVIDIA Container Runtime for Docker to be removed.

The repository maintained by NVIDIA is enabled by default in Ubuntu Software & Updates, Other Software on the DGX Station, as shown in the following screen capture.

Note: Although a Docker repository is also enabled, DGX Station no longer uses this repository to obtain updates to Docker because the repository maintained by NVIDIA takes precedence over the Docker repository.


Screen capture showing the Ubuntu Software & Updates window with the the Other Software tab selected

The following distributions are available from this repository:

release-main
Contains major and minor DGX OS Desktop releases.
release-update
Contains patch updates that distribute security updates, fixes to critical issues, and other updates. This distribution is active only if you have opted in to patch updates as explained in Opting in to DGX OS Desktop Patch Updates.

release is the code name of the Ubuntu OS release on which your DGX OS Desktop release is based. For example, if you are running a DGX OS Desktop 4 release, which is based on Ubuntu 18.04, release is bionic.

3.4.2. Updates to the Ubuntu Software on the DGX Station

Updates to the Ubuntu software on the DGX Station are available from the Canonical repositories.

The repositories that are enabled by default in Ubuntu Software & Updates, Ubuntu Software on the DGX Station are shown in the following screen capture.



Screen capture showing the Ubuntu Software & Updates window with the the Ubuntu Software tab selected

Note:

By default, the DGX Station does not notify you of available updates or automatically install any updates, including important security updates. To minimize the risk to your DGX Station from security vulnerabilities, you must ensure that it is kept up to date with the latest important security updates.

Updates to another LTS base OS version are blocked because they can disrupt the DGX Station software and disable the NVIDIA graphics drivers.

3.7. Updating Software on an Air-Gapped DGX Station System

For security purposes, some installations require that the DGX Station be an air-gapped system. An air-gapped system is not connected to any unsecured networks, such as the public Internet or an unsecured LAN, or to any other computers connected to an unsecured network. The default mechanisms for updating software on the DGX Station and loading container images from the NGC Container Registry require an Internet connection. On an air-gapped system, which is isolated from the Internet, you must provide alternative mechanisms for updating software and loading container images.

3.7.1. Providing DGX Station Software Updates from a Private Repository

The public NVIDIA and Canonical repositories that provide software updates to the DGX Station are Ubuntu repositories. Access to these repositories requires an Internet connection. On an air-gapped system, which is isolated from the Internet, you must provide these updates from a private repository that mirrors the public repositories.

Note: This task applies only to upgrades within the same DGX OS Desktop major release, for example, from 4.5.0 to 4.6.0. It does not apply to upgrades to a new DGX OS Desktop major release, for example, from 3.1.8 to 4.6.0.
  1. Identify the sources corresponding to the public NVIDIA and Canonical repositories that provide updates to the DGX Station software. You can identify these sources from the /etc/apt/sources.list file and the contents of /etc/apt.sources.list.d/ directory, or by using System Settings, Software & Updates.
  2. Create and maintain a private mirror of the repository sources that you identified in the previous step.
  3. Update the sources that provide updates to the DGX Station to use your private repository mirror instead of the public repositories. For detailed instructions, see , which provides examples for DGX OS Desktop 4 releases.

    To update these sources, modify the /etc/apt/sources.list file and the contents of /etc/apt.sources.list.d/ directory.

Future upgrades within the same DGX OS Desktop major release will be obtained from your private repository mirror.

3.7.1.1. Creating the Mirror in a DGX OS 4 System

The instructions in this section are to be performed on a system with network access.

Here are the prerequisites:
  • A system installed with Ubuntu OS is needed to create the mirror because there are several Ubuntu tools that need to be used.
  • You must be logged in to the system installed with Ubuntu OS as an administrator user because this procedure requires sudo privileges.
  • The system must contain enough storage space to replicate the repositories to a file system. The space requirement could be as high as 250 GB.
  • An efficient way to move large amount of data is needed, for example, shared storage in a DMZ, or portable USB drives that can be brought into the air-gapped area.

    The data will need to be moved to the systems that need to be updated. Make sure that any portable drives are formatted using ext4 or FAT32.

  1. Make sure the storage device is attached to the system with network access and identify the mount point of the device. Example mount point used in these instructions: /media/usb/repository
  2. Install the apt-mirror package.
    $ sudo apt update
    $ sudo apt install apt-mirror
  3. Change the ownership of the target directory to the apt-mirror user in the apt-mirror group.
    $ sudo chown apt-mirror:apt-mirror /media/usb/repository

    The target directory must be owned by the user apt-mirror or the replication will not work.

  4. Configure the path of the destination directory in /etc/apt/mirror.list and use the included list of repositories below to retrieve the packages for both Ubuntu base OS and the NVIDIA DGX OS packages.
    ############# config ##################
    #
    set base_path /media/usb/repository #/your/path/here
    #
    # set mirror_path $base_path/mirror
    # set skel_path $base_path/skel
    # set var_path $base_path/var
    # set cleanscript $var_path/clean.sh
    # set defaultarch <running host architecture>
    # set postmirror_script $var_path/postmirror.sh
    set run_postmirror 0
    set nthreads 20
    set _tilde 0
    #
    ############# end config ##############
    # Standard Canonical package repositories:
    deb http://security.ubuntu.com/ubuntu bionic-security main
    deb http://security.ubuntu.com/ubuntu bionic-security universe
    deb http://security.ubuntu.com/ubuntu bionic-security multiverse
    deb http://archive.ubuntu.com/ubuntu/ bionic main multiverse universe
    deb http://archive.ubuntu.com/ubuntu/ bionic-updates main multiverse universe
    #
    deb-i386 http://security.ubuntu.com/ubuntu bionic-security main
    deb-i386 http://security.ubuntu.com/ubuntu bionic-security universe
    deb-i386 http://security.ubuntu.com/ubuntu bionic-security multiverse
    deb-i386 http://archive.ubuntu.com/ubuntu/ bionic main multiverse universe
    deb-i386 http://archive.ubuntu.com/ubuntu/ bionic-updates main multiverse universe
    #
    # DGX specific repositories:
    deb http://international.download.nvidia.com/dgxstation/repos/bionic bionic main restricted universe multiverse
    deb http://international.download.nvidia.com/dgxstation/repos/bionic bionic-updates main restricted universe multiverse
    deb http://international.download.nvidia.com/dgxstation/repos/bionic bionic-r418+cuda10.1 main multiverse restricted universe
    deb http://international.download.nvidia.com/dgxstation/repos/bionic bionic-r450+cuda11.0 main multiverse restricted universe
    #
    deb-i386 http://international.download.nvidia.com/dgxstation/repos/bionic bionic main restricted universe multiverse
    deb-i386 http://international.download.nvidia.com/dgxstation/repos/bionic bionic-updates main restricted universe multiverse
    # Only for DGX OS 4.1.0
    deb-i386 http://international.download.nvidia.com/dgxstation/repos/bionic bionic-r418+cuda10.1 main multiverse restricted universe
    # Clean unused items
    clean http://archive.ubuntu.com/ubuntu
    clean http://security.ubuntu.com/ubuntu
  5. Run apt-mirror and wait for it to finish downloading content.

    This will take a long time depending on the network connection speed.

    $ sudo apt-mirror
  6. Eject the removable storage with all packages.
    $ sudo eject /media/usb/repository 

3.7.1.2. Configuring the Target Air-Gapped DGX OS 4 System

The instructions in this section are to be performed on the target air-gapped DGX system.

Here are the prerequisites:
  • The target air-gapped DGX system is installed, has gone through the first boot process, and is ready to be updated with the latest packages.
  • The USB storage device on which the mirrors were created is attached to the target DGX system.

    There are other ways to transfer the data that are not covered in this document as they will depend on the data center policies for the air-gapped environment.

  1. Mount the storage device on the air-gapped system to /media/usb/repository for consistency.
  2. Configure the apt command to use the file system as the repository in the file /etc/apt/sources.list by modifying the following lines.
    deb file:///media/usb/repository/mirror/security.ubuntu.com/ubuntu bionic-security main
    deb file:///media/usb/repository/mirror/security.ubuntu.com/ubuntu bionic-security universe
    deb file:///media/usb/repository/mirror/security.ubuntu.com/ubuntu bionic-security multiverse
    deb file:///media/usb/repository/mirror/archive.ubuntu.com/ubuntu/ bionic main multiverse universe
    deb file:///media/usb/repository/mirror/archive.ubuntu.com/ubuntu/ bionic-updates main multiverse universe
  3. Configure apt to use the NVIDIA DGX OS packages in the file /etc/apt/sources.list.d/dgx.list.
    deb file:///media/usb/repository/mirror/international.download.nvidia.com/dgxstation/repos/bionic bionic main multiverse restricted universe
  4. If present, remove the file /etc/apt/sources.list.d/docker.list as it is no longer needed and removing it will eliminate error messages during the update process.
  5. (For DGX OS Release 4.1 and later only) Configure apt to use the NVIDIA DGX OS packages in the file /etc/apt/sources.list.d/dgxstation-bionic-r418-cuda10-1-repo.list.
    $ echo "deb file:///media/usb/repository/mirror/international.download.nvidia.com/dgxstation/repos/bionic/ bionic-r418+cuda10.1 main multiverse restricted universe" | sudo tee /etc/apt/sources.list.d/dgxstation-bionic-r418-cuda10-1-repo.list
  6. Optional: (For DGX OS Release 4.5 and later only) If you want to use the R450 NVIDIA graphics driver and CUDA Toolkit 11.0, configure apt to use the NVIDIA DGX OS packages in the file /etc/apt/sources.list.d/dgxstation-bionic-r450-cuda11-0-repo.list.
    $ echo "deb file:///media/usb/repository/mirror/international.download.nvidia.com/dgxstation/repos/bionic/ bionic-r450+cuda11.0 main multiverse restricted universe" | sudo tee /etc/apt/sources.list.d/dgxstation-bionic-r450-cuda11-0-repo.list
    Note: If you want to continue using earlier releases, for example the R418 NVIDIA graphic driver and CUDA Toolkit 10.1, omit this step.
  7. Edit the file /etc/apt/preferences.d/nvidia to update the Pin parameter as follows.
    Package: *
    #Pin: origin international.download.nvidia.com
    Pin: release o=NVIDIA
    Pin-Priority: 600 
  8. Update the apt repository.
    Note: On systems running DGX OS Desktop, some errors will appear because apt-mirror does not handle the @ symbol in URIs. You can ignore these errors because they will not prevent the system from being upgraded.
    $ sudo apt update

    Output from this command is similar to the following example.

    Get:1 file:/media/usb/repository/mirror/security.ubuntu.com/ubuntu bionic-security InRelease [88.7 kB]
    Get:1 file:/media/usb/repository/mirror/security.ubuntu.com/ubuntu bionic-security InRelease [88.7 kB]
    Get:2 file:/media/usb/repository/mirror/archive.ubuntu.com/ubuntu bionic InRelease [242 kB]
    Get:2 file:/media/usb/repository/mirror/archive.ubuntu.com/ubuntu bionic InRelease [242 kB]
    Get:3 file:/media/usb/repository/mirror/archive.ubuntu.com/ubuntu bionic-updates InRelease [88.7 kB]
    Get:4 file:/media/usb/repository/mirror/international.download.nvidia.com/dgxstation/repos/bionic bionic-r418+cuda10.1 InRelease [13.0 kB]
    Get:5 file:/media/usb/repository/mirror/international.download.nvidia.com/dgxstation/repos/bionic bionic-r450+cuda11.0 InRelease [7070 B]
    Get:5 file:/media/usb/repository/mirror/international.download.nvidia.com/dgxstation/repos/bionic bionic-r450+cuda11.0 InRelease [7070 B]
    Get:6 file:/media/usb/repository/mirror/international.download.nvidia.com/dgxstation/repos/bionic bionic InRelease [13.1 kB]
    Get:3 file:/media/usb/repository/mirror/archive.ubuntu.com/ubuntu bionic-updates InRelease [88.7 kB]
    Get:4 file:/media/usb/repository/mirror/international.download.nvidia.com/dgxstation/repos/bionic bionic-r418+cuda10.1 InRelease [13.0 kB]
    Get:6 file:/media/usb/repository/mirror/international.download.nvidia.com/dgxstation/repos/bionic bionic InRelease [13.1 kB]
    Hit:7 https://download.docker.com/linux/ubuntu bionic InRelease
    Get:8 file:/media/usb/repository/mirror/international.download.nvidia.com/dgxstation/repos/bionic bionic-r418+cuda10.1/multiverse amd64 Packages [10.1 kB]
    Get:9 file:/media/usb/repository/mirror/international.download.nvidia.com/dgxstation/repos/bionic bionic-r450+cuda11.0/multiverse amd64 Packages [17.4 kB]
    Get:10 file:/media/usb/repository/mirror/international.download.nvidia.com/dgxstation/repos/bionic bionic-r418+cuda10.1/restricted amd64 Packages [10.3 kB]
    Get:11 file:/media/usb/repository/mirror/international.download.nvidia.com/dgxstation/repos/bionic bionic-r450+cuda11.0/restricted amd64 Packages [26.4 kB]
    Get:12 file:/media/usb/repository/mirror/international.download.nvidia.com/dgxstation/repos/bionic bionic-r418+cuda10.1/restricted i386 Packages [516 B]
    Get:13 file:/media/usb/repository/mirror/international.download.nvidia.com/dgxstation/repos/bionic bionic/multiverse amd64 Packages [44.5 kB]
    Get:14 file:/media/usb/repository/mirror/international.download.nvidia.com/dgxstation/repos/bionic bionic/multiverse i386 Packages [8,575 B]
    Get:15 file:/media/usb/repository/mirror/international.download.nvidia.com/dgxstation/repos/bionic bionic/restricted i386 Packages [745 B]
    Get:16 file:/media/usb/repository/mirror/international.download.nvidia.com/dgxstation/repos/bionic bionic/restricted amd64 Packages [8,379 B]
    Get:17 file:/media/usb/repository/mirror/international.download.nvidia.com/dgxstation/repos/bionic bionic/universe amd64 Packages [2,946 B]
    Get:18 file:/media/usb/repository/mirror/international.download.nvidia.com/dgxstation/repos/bionic bionic/universe i386 Packages [496 B]
    Reading package lists... Done
    Building dependency tree
    Reading state information... Done
    249 packages can be upgraded. Run 'apt list --upgradable' to see them.
    $ 
  9. Upgrade the system using the newly configured local repositories.
    $ sudo apt full-upgrade

    If you configured apt to use the NVIDIA DGX OS packages in the file /etc/apt/sources.list.d/dgxstation-bionic-r450-cuda11-0-repo.list, the NVIDIA graphics driver is upgraded to the R450 driver and the package sources are updated to obtain future updates from the R450 driver repositories.

  10. Optional: (For DGX OS Release 4.5 and later only) If you configured apt to use the NVIDIA DGX OS packages in the file /etc/apt/sources.list.d/dgxstation-bionic-r450-cuda11-0-repo.list and want to use CUDA Toolkit 11.0, install it.
    $ sudo apt install cuda-toolkit-11-0
    Note: If you did not configure apt to use the NVIDIA DGX OS packages in the file /etc/apt/sources.list.d/dgxstation-bionic-r450-cuda11-0-repo.list, omit this step. If you try to install CUDA Toolkit 11.0, the attempt fails.

3.7.1.3. Creating the Mirror in a DGX OS 5 System

The instructions in this section are to be performed on a system with network access.

The following are the prerequisites.
  • A system installed with Ubuntu OS is needed to create the mirror because there are several Ubuntu tools that need to be used.
  • You must be logged in to the system installed with Ubuntu OS as an administrator user because this procedure requires sudo privileges.
  • The system must contain enough storage space to replicate the repositories to a file system. The space requirement could be as high as 250 GB.
  • An efficient way to move large amount of data is needed, for example, shared storage in a DMZ, or portable USB drives that can be brought into the air-gapped area.

    The data will need to be moved to the systems that need to be updated. Make sure that any portable drives are formatted using ext4 or FAT32.

  1. Ensure that the storage device is attached to the system with network access and identify the mount point of the device.
    Here is a sample mount point that was used in these instructions:
    /media/usb/repository
  2. Install the apt-mirror package.
    $ sudo apt update
    $ sudo apt install apt-mirror
  3. Change the ownership of the target directory to the apt-mirror user in the apt-mirror group.
    $ sudo chown apt-mirror:apt-mirror /media/usb/repository

    The target directory must be owned by the user apt-mirror or the replication will not work.

  4. Configure the path of the destination directory in /etc/apt/mirror.list and use the included list of repositories below to retrieve the packages for both Ubuntu base OS and the NVIDIA DGX OS packages.
    ############# config ##################
    #
    set base_path /media/usb/repository #/your/path/here
    #
    # set mirror_path $base_path/mirror
    # set skel_path $base_path/skel
    # set var_path $base_path/var
    # set cleanscript $var_path/clean.sh
    # set defaultarch <running host architecture>
    # set postmirror_script $var_path/postmirror.sh
    set run_postmirror 0
    set nthreads 20
    set _tilde 0
    #
    ############# end config ##############
    # Standard Canonical package repositories:
    deb http://security.ubuntu.com/ubuntu focal-security main multiverse universe restricted
    deb http://archive.ubuntu.com/ubuntu/ focal main multiverse universe restricted
    deb http://archive.ubuntu.com/ubuntu/ focal-updates main multiverse universe restricted
    #
    deb-i386 http://security.ubuntu.com/ubuntu focal-security main multiverse universe restricted
    deb-i386 http://archive.ubuntu.com/ubuntu/ focal main multiverse universe restricted
    deb-i386 http://archive.ubuntu.com/ubuntu/ focal-updates main multiverse universe restricted
    #
    # CUDA specific repositories:
    deb http://developer.download.nvidia.com/compute/cuda/repos/ubuntu2004/x86_64/ /
    #
    # DGX specific repositories:
    deb http://repo.download.nvidia.com/baseos/ubuntu/focal/x86_64/ focal common dgx
    deb http://repo.download.nvidia.com/baseos/ubuntu/focal/x86_64/ focal-updates common dgx
    #
    deb-i386 http://repo.download.nvidia.com/baseos/ubuntu/focal/x86_64/ focal common dgx
    deb-i386 http://repo.download.nvidia.com/baseos/ubuntu/focal/x86_64/ focal-updates common dgx
    # Clean unused items
    clean http://archive.ubuntu.com/ubuntu
    clean http://security.ubuntu.com/ubuntu
  5. Run apt-mirror and wait for it to finish downloading content.

    This will take a long time depending on the network connection speed.

    $ sudo apt-mirror
  6. Eject the removable storage with all packages.
    $ sudo eject /media/usb/repository 

3.7.1.4. Configuring the Target Air-Gapped DGX OS 5 System

The instructions in this section are to be performed on the target air-gapped DGX system.

The following are the prerequisites.
  • The target air-gapped DGX system is installed, has gone through the first boot process, and is ready to be updated with the latest packages.
  • The USB storage device on which the mirrors were created is attached to the target DGX system.

    There are other ways to transfer the data that are not covered in this document as they will depend on the data center policies for the air-gapped environment.

  1. Mount the storage device on the air-gapped system to /media/usb/repository for consistency.
  2. Configure the apt command to use the file system as the repository in the file /etc/apt/sources.list by modifying the following lines.
    deb file:///media/usb/repository/mirror/security.ubuntu.com/ubuntu focal-security main multiverse universe restricted
    deb file:///media/usb/repository/mirror/archive.ubuntu.com/ubuntu/ focal main multiverse universe restricted
    deb file:///media/usb/repository/mirror/archive.ubuntu.com/ubuntu/ focal-updates main multiverse universe restricted
  3. Configure apt to use the NVIDIA DGX OS packages in the /etc/apt/sources.list.d/dgx.list file.
    deb file:///media/usb/repository/mirror/repo.download.nvidia.com/baseos/ubuntu/focal/x86_64/ focal main dgx
    deb file:///media/usb/repository/mirror/repo.download.nvidia.com/baseos/ubuntu/focal/x86_64/ focal-updates main dgx
    Configure apt to use the NVIDIA CUDA packages in the /etc/apt/sources.list.d/cuda-compute-repo.list file.
    deb file:///media/usb/repository/mirror/developer.download.nvidia.com/compute/cuda/repos/ubuntu2004/x86_64/ /
  4. Update the apt repository.
    Note: On systems running DGX OS Desktop, some errors will appear because apt-mirror does not handle the @ symbol in URIs. You can ignore these errors because they will not prevent the system from being upgraded.
    $ sudo apt update

    Output from this command is similar to the following example.

    Get:1 file:/media/usb/repository/mirror/security.ubuntu.com/ubuntu focal-security InRelease [107 kB]
    Get:2 file:/media/usb/repository/mirror/archive.ubuntu.com/ubuntu focal InRelease [265 kB]
    Get:3 file:/media/usb/repository/mirror/archive.ubuntu.com/ubuntu focal-updates InRelease [111 kB]
    Get:4 file:/media/usb/repository/mirror/developer.download.nvidia.com/compute/cuda/repos/ubuntu2004/x86_64  InRelease
    Get:5 file:/media/usb/repository/mirror/repo.download.nvidia.com/baseos/ubuntu/focal/x86_64 focal InRelease [12.5 kB]
    Get:6 file:/media/usb/repository/mirror/repo.download.nvidia.com/baseos/ubuntu/focal/x86_64 focal-updates InRelease [12.4 kB]
    Get:7 file:/media/usb/repository/mirror/developer.download.nvidia.com/compute/cuda/repos/ubuntu2004/x86_64  Release [697 B]
    Get:8 file:/media/usb/repository/mirror/developer.download.nvidia.com/compute/cuda/repos/ubuntu2004/x86_64  Release.gpg [836 B]
    Reading package lists... Done
    
  5. Upgrade the system using the newly configured local repositories.
    $ sudo apt full-upgrade

3.7.2. Loading a Container Image onto an Air-Gapped DGX Station System

Loading a container image from the NGC Container Registry requires an Internet connection. On an air-gapped system, which is isolated from the Internet, you must use a removable medium to copy the container image from a system with an Internet connection to the air-gapped system.

  1. On a system with an Internet connection, log in to the NGC Container Registry and load the container image that you want. For instructions, refer to NGC Container Registry for DGX User Guide.
  2. Save the container image as a tar archive.
    $ docker save nvcr.io/registry-space/repository:tag > archive-file.tar
    registry-space
    The name of the space within the registry that contains the container image. For container images provided by NVIDIA, the registry space is nvidia.
    repository
    The repository that contains the container image. A repository is a collection of all versions of a container image with the same name. The repository name is the main container image name.
    tag
    A tag that identifies the version of the container image.
    archive-file
    Your choice of name for the archive file to which you are saving the container image.
  3. Transfer the image to the air-gapped system by using a removable medium such as a USB flash drive or DVD-ROM.
  4. On the air-gapped system, load the container image from the local copy of the archive file that contains the image.
    $ docker load –i framework.tar
  5. Confirm that the image is loaded on the air-gapped system.
    $ docker images

4. Maintaining and Servicing the NVIDIA DGX Station

Be sure to familiarize yourself with the NVIDIA Terms & Conditions documents before attempting to perform any modification or repair to the DGX Station. These Terms & Conditions for the DGX Station can be found through the NVIDIA DGX Systems Support page.

CAUTION:
The DGX Station is designed as an integrated system and does not support the installation of additional PCIe devices such as GPU cards. Any attempt to modify the DGX Station by installing additional PCIe devices is an unauthorized modification and will void the DGX Station hardware warranty. Any such modification will also impair the performance of the system, may overload the system’s electrical circuits, and may cause it to overheat.

4.1. Problem Resolution and Customer Care

Log on to the NVIDIA Enterprise Support site for assistance with troubleshooting, diagnostics, or to report problems with your DGX Station.

Refer to Customer Support for the NVIDIA DGX Station for additional contact information.

4.2. Cleaning the Mesh Filter Under the DGX Station

To prevent dust from entering the DGX Station through the ventilation holes under the unit, a mesh filter is fitted to the underside of the DGX Station. Clean this mesh filter periodically to prevent the accumulation of dust on the filter from impeding the flow of air through the DGX Station.

  1. Reach under the front of the DGX Station and grasp the mesh filter by its handle.
  2. Pull the mesh filter towards you to slide it out from the font of the unit.

    Line drawing showing the mesh filter being pulled out from underneath the DGX Station.

  3. Use compressed air to blow the dust from the mesh filter.
  4. Line up the mesh filter with the runners under the DGX Station and slide it back into position under the unit.

    Line drawing showing the mesh filter being slid back underneath the DGX Station.

4.3. Since DGX OS Desktop 4.4.0: Checking the Health of and Collecting Troubleshooting Information for the DGX Station

Starting with release 4.4.0, use NVIDIA System Management (NVSM) to check the health of and collecting troubleshooting information for the DGX Station.

For information about how to use NVSM to perform these tasks, see the relevant instructions in the NVSM documentation.

Task Instructions
Checking the health of the DGX Station Show Health in NVIDIA System Management User Guide
Collecting troubleshooting information for the DGX Station Dump Health in NVIDIA System Management User Guide

For information about how to perform these tasks for earlier releases, see the following topics:

4.4. DGX OS Desktop 4.3.0 and Earlier: Collecting Information for Troubleshooting the DGX Station

Note: Starting with release 4.4.0, the tool to collect troubleshooting information (nvsysinfo) tool is replaced by NVIDIA System Management (NVSM). For information about how to use NVSM to perform this task, see Dump Health in NVIDIA System Management User Guide.

To help diagnose and resolve issues, the DGX Station provides a tool to collect troubleshooting information for NVIDIA Support Enterprise Services.

The tool verifies basic functionality and performance of the DGX Station and collects the following information in an xz-compressed tar archive:

  • Log files
  • Hardware inventory
  • SW inventory

To collect information for troubleshooting the DGX Station, run the following command:

sudo nvsysinfo [-o output-file]
Note: For DGX OS Desktop releases 3.1.1 through 3.1.3, the command to run is as follows:
sudo nvidia-sysinfo [-o output-file]
output-file

The path of the file in which the information is written.

If you omit the output file, the name of the file to which the information is written depends on the release of DGX OS Desktop that you are using.

DGX OS Desktop Release File Name
Releases 4.0.4 through 4.3.0 /tmp/nvsysinfo-host-name-timestamp.tar.xz
Any 3.x release since 3.1.4 /tmp/nvsysinfo-timestamp.random-number.out
Releases 3.1.1 through 3.1.3 /tmp/nvidia-sys-info-timestamp.random-number.out

Use any method that is convenient for you to send the file to NVIDIA Support Enterprise Services. For example, send the file as an e-mail attachment.

4.5. DGX OS Desktop 4.3.0 and Earlier: Checking the Health of the DGX Station

Note: Starting with release 4.4.0, the NVIDIA System Health Checker (nvhealth) tool is replaced by NVIDIA System Management (NVSM). For information about how to use NVSM to perform this task, see Show Health in NVIDIA System Management User Guide.

The DGX Station provides the NVIDIA System Health Checker (nvhealth) tool to exercise the system and verify its health. The output of nvhealth is an itemized list of checks and their status, typically Healthy or Unhealthy. On a healthy system, all checks should return Healthy. You should investigate any checks that return Unhealthy to determine their root cause and resolve them.

To check the health of the DGX Station, run the following command:

$ sudo nvhealth [-k output-file]
output-file

The name and the path of the file in which the raw state of the system is written. The nvhealth command displays this file name at the end of the output from the command.

If you omit the output file, the information is written to the file /tmp/nvhealth-log.random-string.jsonl, for example, /tmp/nvhealth-log.6wf3WriAC3.jsonl.

Note:

If you run the nvhealth command while the RAID array is being rebuilt after a change in RAID level to RAID 5, nvhealth reports the status of the RAID volume as unhealthy. To avoid this potentially misleading result, wait until RAID array is rebuilt before running nvhealth.

To check the progress of the rebuild and show the percentage complete and an estimate of the time to completion, run this command:

# cat /proc/mdstat

Personalities : [raid6] [raid5] [raid4] [linear] [multipath] [raid0] [raid1] [raid10]
md0 : active raid5 sdb[0] sdc[1] sdd[2]
     181764096 blocks super 1.2 level 5, 512k chunk, algorithm 2 [4/3] [UUU_]
     [===>.................]  recovery = 17.2% (10426232/60588032) finish=45.8min speed=18238K/sec

4.6. Replacing the System and Components

Be sure to familiarize yourself with the NVIDIA Terms & Conditions documents before attempting to perform any modification or repair to the DGX Station. These Terms & Conditions for the DGX Station can be found through the NVIDIA DGX Systems Support page.

Contact NVIDIA Enterprise Support to obtain an RMA number for any system or component that needs to be returned for repair or replacement. When replacing a component, use only the replacement supplied to you by NVIDIA unless you are directed otherwise.

The following components are customer-replaceable:
  • Solid State Drives (SSDs)
    Note: If you want to add SSDs for data storage to the DGX Station, obtain the SSDs from NVIDIA Enterprise Support.
  • DIMMs
    Note: DIMMs are customer replaceable if a DIMM fails or to increase the system memory capacity to 512 GB. If you want to increase the system memory capacity to 512 GB, obtain the replacement DIMMs from NVIDIA Enterprise Support.
  • CMOS power cell
    Note: Obtain the replacement CMOS power cell yourself, not from NVIDIA Enterprise Support.

Return failed high-value components to NVIDIA. You do not need to return functional 32-GB DIMMs or low-cost items such as CMOS power cells.

4.6.1. Replacing the System

When returning a DGX Station under RMA, consider the following points.

Packaging

To prevent damage during shipping, repack the DGX Station in the packaging in which the replacement unit was advanced shipped by following the instructions in Repacking the DGX Station for Shipment.

SSDs

If necessary, you can remove and keep the SSDs prior to shipping the system back for replacement. If you already received a replacement system and you want to keep the original SSDs, install the new SSDs into the defective system when shipping it back.

AC Power Cable

Do not return the AC power cable when returning the DGX Station.

Accessories

Include all supplied accessories except the AC power cable when returning the DGX Station.

4.6.2. Repacking the DGX Station for Shipment

If you are returning the DGX Station to NVIDIA under an RMA, repack it in the packaging in which the replacement unit was advanced shipped to prevent damage during shipment.

CAUTION:
The DGX Station weighs 88 lbs (40 kg). Do not attempt to lift the DGX Station. Instead, move it into position by rolling it on its fitted casters.
Before you begin, ensure that the foam packing piece that surrounds the GPU cards inside the DGX Station has been replaced. For detailed instructions, see Removing or Replacing the Packing Inside the DGX Station.
  1. Place the bottom tray of the DGX Station shipping carton on the floor and ensure that the flap at the front of the tray is pulled down to form a ramp.
  2. Roll the DGX Station up the ramp into the bottom tray of its shipping carton.
    CAUTION:
    Ensure that you have a second person to help you roll the DGX Station into position.


    Line drawing showing the DGX Station being rolled into the bottom tray of its shipping carton.

  3. Insert the front packing piece into the tray, ensuring that the lip of the packing piece is under the DGX Station.

    Line drawing showing the front packing piece being inserted into the bottom tray of the DGX Station shipping carton.

  4. Insert the side packing pieces into the tray, ensuring that the lip of each piece is under the DGX Station.

    Line drawing showing side packing pieces being inserted into the bottom tray of the DGX Station shipping carton.

  5. Pack all supplied accessories in the accessory boxes except the AC power cable. Keep the AC power cable to use with your replacement DGX Station.
  6. Place both accessory boxes in the slots in the tray on each side of the DGX Station.

    Ensure that the lugs that protrude from the edges of each accessory box are facing away from the DGX Station.



    Line drawing showing accessory boxes pieces being placed in the slots in the bottom tray of the DGX Station shipping carton.

    The accessory boxes are required to help hold the DGX Station in place in its packaging during shipment. Be sure to place both accessory boxes in the slots in the tray, even if one or both boxes are empty.

  7. Pull up the flap at the front of the bottom tray of the DGX Station shipping carton.

    Line drawing the flap at the front of the bottom tray of the DGX Station shipping carton being pulled up.

  8. Lower the top cover of the shipping carton into position so that the holes in the top cover and the holes in the bottom tray are aligned.

    Line drawing showing the top cover of the DGX Station shipping carton being lowered into position.

  9. Insert the packing clasps into the cutouts in the top cover of the shipping carton and engage the clasps to secure the top cover in place.

    To prevent the packing clasps from becoming jammed inside the shipping carton, do not use excessive force when inserting them into the cutouts.



    Line drawing showing the DGX Station packing clasps being replaced.

4.6.3. Replacing a DIMM

You can replace a dual inline memory module (DIMM) if a DIMM fails or if you want replace all eight factory-installed 32-GB DIMMs with 64-GB DIMMs to give a total capacity of 512 GB.

Before attempting to replace a faulty DIMM, contact NVIDIA Enterprise Customer support for help in determining the location ID of the faulty DIMM that needs replacement.

The location ID is one of the following alphanumeric designators:

  • DIMM_A1
  • DIMM_A2
  • DIMM_B1
  • DIMM_B2
  • DIMM_D2
  • DIMM_D1
  • DIMM_C2
  • DIMM_C1
CAUTION:
The components inside the DGX Station are static-sensitive devices. Protect these devices against electrostatic discharge (ESD) by wearing a wrist strap connected to the DGX Station chassis ground and placing components on static-free work surfaces.

The DIMMs are located on the motherboard inside the DGX Station.

  1. Turn off the DGX Station and disconnect the network and power cables.
  2. Remove the side panel on the right of the DGX Station when viewed from the rear.
    1. Push the button on the right side of the DGX Station back panel to release the panel.

      Line drawing showing the button on the right side of the DGX Station back panel being pushed.

    2. Lift the panel to remove it.

      Line drawing showing the DGX Station side-panel being removed.

      CAUTION:
      To prevent damage from electrostatic discharge, avoid touching any of the components inside the DGX Station other than any components that you are replacing or servicing.
  3. If you are replacing a faulty DIMM, use the following figure as a guide to locate the faulty DIMM.

    Diagram showing the DIMM socket locations on the DGX Station motherboard.

  4. Remove the DIMM.

    If you are replacing 32-GB DIMMs with 64-GB DIMMs to increase the system memory capacity, remove all eight 32-GB DIMMs before fitting the replacement 64-GB DIMMs.



    Diagram showing removal of a DIMM from its socket.

    1. Press upwards on the latch at the upper end of the DIMM socket to open the latch and unseat the DIMM from the socket.
    2. Pull the DIMM towards you to remove it from the socket.
  5. Carefully insert the replacement DIMM. If you are replacing 32-GB DIMMs with 64-GB DIMMs to increase the system memory capacity, perform this step for each replacement 64-GB DIMM.

    Diagram showing insertion of a DIMM into its socket.

    1. Make sure the socket latch is open.
    2. Position the replacement DIMM over the socket, making sure that the notch on the DIMM lines up with the key in the slot, then press the DIMM into the socket until the latch clicks into place. When the DIMM is correctly seated, the latch should be closed as shown in the following figure.

      Diagram showing a DIMM after it is correctly seated with its socket latch closed.

  6. Replace the side panel of the DGX Station.
    1. Align the bottom edge of the side panel with the bottom edge of the DGX Station.

      Line drawing showing the side-panel being aligned with the bottom edge of the DGX Station.

    2. Firmly push the panel back into place to re-engage the latch.

      Line drawing showing the DGX Station side-panel latch being re-engaged.

  7. Reconnect the network and power cables and power on the DGX Station. The DGX Station stops at its power-on self-test (POST) and then shuts down.
  8. After the DGX Station shuts down, power it on again. When powered on a second time, the DGX Station starts up normally.

4.6.4. Replacing the CMOS Power Cell in the DGX Station

The CMOS power cell in the DGX Station provides power to the Real Time Clock (RTC) to maintain BIOS settings such as the system time and date while DGX Station is disconnected from the AC power supply. If the DGX Station is restarted after being disconnected from the AC power supply and the CMOS power cell is discharged, you are warned that the RTC has been reset and that the settings must be updated in the BIOS. To avoid these warnings, replace the CMOS power cell in the DGX Station.

Warning: If the date is reset or you are prompted to press F1 to select a boot option each time you boot your DGX system, you must replace the battery.
CAUTION:
The components inside the DGX Station are static-sensitive devices. Protect these devices against electrostatic discharge (ESD) by wearing a wrist strap connected to the DGX Station chassis ground and placing components on static-free work surfaces.

To complete this task, you need the following tools and materials:

  • 1 small flat head screwdriver
  • 1 fresh CR2032 power cell
  1. Turn off the DGX Station and disconnect the network and power cables.
  2. Remove the side panel on the right of the DGX Station when viewed from the rear.
    1. Push the button on the right side of the DGX Station back panel to release the panel.

      Line drawing showing the button on the right side of the DGX Station back panel being pushed.

    2. Lift the panel to remove it.

      Line drawing showing the DGX Station side-panel being removed.

      CAUTION:
      To prevent damage from electrostatic discharge, avoid touching any of the components inside the DGX Station other than any components that you are replacing or servicing.
  3. Remove the old CMOS power cell.

    The CMOS power cell is located in the top left corner of the motherboard inside the DGX Station.



    Diagram showing the location of the CMOS power cell on the DGX Station motherboard.

    1. Carefully insert the blade of the small flat head screwdriver between the motherboard and the CMOS power cell.
    2. Use the small flat head screwdriver to pry the CMOS power cell from the motherboard.
    Warning: Do not dispose of the old CMOS power cell in municipal waste.
  4. Carefully align the replacement CR2032 CMOS power cell in the receptacle on the motherboard with the + sign facing you and press it into position.
  5. Replace the side panel of the DGX Station.
    1. Align the bottom edge of the side panel with the bottom edge of the DGX Station.

      Line drawing showing the side-panel being aligned with the bottom edge of the DGX Station.

    2. Firmly push the panel back into place to re-engage the latch.

      Line drawing showing the DGX Station side-panel latch being re-engaged.

  6. Reconnect the network and power cables and power on the DGX Station. The DGX Station stops at its power-on self-test (POST) and then shuts down.
  7. After the DGX Station shuts down, power it on again. When powered on a second time, the DGX Station starts up normally.
  8. If necessary, set the system date and system time to the current time and date.
    1. At the first NVIDIA screen to appear while the system is rebooting, press F2 to access the UEFI BIOS Utility - EZ Mode screen.
    2. Click the date and time displayed in the top left corner of the UEFI BIOS Utility - EZ Mode screen.

      Screen capture showing the location of the date and time in the UEFI BIOS Utility - EZ Mode screen.

    3. In the System Date & Time setting screen that pops up, fill in the current date and time and click Save.

      Screen capture showing the System Date & Time setting pop-up screen.

    4. Press F10 and, when prompted, select OK to save your changes and exit.

4.7. Maintaining the DGX Station Persistent Storage

The DGX Station persistent storage consists of SSDs for data storage and the operating system. As supplied from the factory, these SSDs are configured as described in System Memory and Storage.

4.7.1. Changing the RAID Level of the RAID Array

As supplied from the factory, the RAID level of the DGX Station RAID array is RAID 0. RAID 0 provides the maximum storage capacity, but does not provide any redundancy. If a single SSD in the array fails, all data stored on the array is lost. If you are willing to accept reduced capacity in return for some level of protection against failure of a single SSD, you can change the level of the RAID array to RAID 5. If you change the RAID level from RAID 0 to RAID 5, the total storage capacity of the RAID array is reduced from 5.76 TB to 3.84 TB.

Before changing the RAID level of the DGX Station RAID array, back up all data on the array that you want to preserve. Changing the RAID level of the DGX Station RAID array erases all data stored on the array.

The DGX Station software includes the custom script configure_raid_array.py, which you can use to change the level of the RAID array without unmounting the RAID volume.

  • To change the RAID level to RAID 5, run the following command:

    $ sudo configure_raid_array.py -m raid5
    Note:

    After you change the RAID level to RAID 5, the RAID array is rebuilt. A RAID array that is being rebuilt is online and ready to be used, but a check on the health of the DGX Station reports the status of the RAID volume as unhealthy. Therefore, avoid checking the health of the DGX Station while the RAID array is being rebuilt. For more information, see DGX OS Desktop 4.3.0 and Earlier: Checking the Health of the DGX Station.

    The time required to rebuild the RAID array depends on the workload on the system. On an idle system, the rebuild might be complete within 30 minutes.

  • To change the RAID level to RAID 0, run the following command:

    $ sudo configure_raid_array.py -m raid0

To confirm that the RAID level was changed as required, run the lsblk command. The entry in the TYPE column for each SSD in the RAID array indicates the RAID level of the array.

The following example shows that the RAID level of the array is RAID 0. The name of the RAID volume is md0 and the mount point of the volume is /raid.

~$ lsblk
NAME   MAJ:MIN RM  SIZE RO TYPE  MOUNTPOINT
sda      8:0    0  1.8T  0 disk
|_sda1   8:1    0  487M  0 part  /boot/efi
|_sda2   8:2    0  1.8T  0 part  /
sdb      8:16   0  1.8T  0 disk
|_md0    9:0    0  5.2T  0 raid0 /raid
sdc      8:32   0  1.8T  0 disk
|_md0    9:0    0  5.2T  0 raid0 /raid
sdd      8:48   0  1.8T  0 disk
|_md0    9:0    0  5.2T  0 raid0 /raid

4.7.2. Checking the Status of the DGX Station RAID Array

Use the mdadm command to print details of the md0 device.

$ sudo mdadm -D /dev/md0

This example shows the status of a RAID array that is functioning properly.

$ sudo mdadm -D /dev/md0
        Version : 1.2
  Creation Time : Mon Jun  5 17:40:48 2017
     Raid Level : raid0
     Array Size : 374964224 (357.59 GiB 383.96 GB)
   Raid Devices : 3
  Total Devices : 3
    Persistence : Superblock is persistent

    Update Time : Mon Jun  5 17:40:48 2017
          State : clean
 Active Devices : 3
Working Devices : 3
 Failed Devices : 0
  Spare Devices : 0

     Chunk Size : 512K
           Name : lab-VirtualBox:0  (local to host lab-VirtualBox)
           UUID : c8ba911a:8634bd99:2ebeea3d:c9a7db4c
         Events : 0

    Number   Major   Minor      RaidDevice State
       0       8       16        0      active sync   /dev/sdb
       1       8       32        1      active sync   /dev/sdc
       2       8       48        2      active sync   /dev/sdd

This example shows the status of a RAID array in which one SSD has failed or is missing. The failed or missing SSD is identified by the empty RaidDevice State column.

$ sudo mdadm -D /dev/md0
 ...

    Number   Major   Minor      RaidDevice State
       0       8       16        0      active sync   /dev/sdb
       1       8       32        1      active sync
       2       8       48        2      active sync   /dev/sdd

4.7.3. Checking the Status of the DGX Station SSDs

LEDs on the DGX Station SSDs indicate the status of the SSDs. The SSDs are mounted inside the DGX Station and are visible only when the side panel that covers the SSDs is removed.

  1. Remove the side panel on the left of the DGX Station when viewed from the rear.
    1. Push the button on the left side of the DGX Station back panel to release the panel.
    2. Lift the panel to remove it.
      CAUTION:
      To prevent damage from electrostatic discharge, avoid touching any of the components inside the DGX Station other than any components that you are replacing or servicing.
  2. Examine each SSD to determine its status from the state of the LED on the SSD.

    Line drawing showing the LEDs and identifiers of the DGX Station SSDs.

    On (steady)
    The SSD is operational but is idle.
    On (blinking)
    The SSD is being read from or written to.
    Off
    The SSD has failed and must be replaced.
  3. Replace the side panel of the DGX Station.
    1. Align the bottom edge of the side panel with the bottom edge of the DGX Station.
    2. Firmly push the panel back into place to re-engage the latches.
If an SSD has failed, you must replace it as explained in Adding or Replacing an SSD.

4.7.4. Adding or Replacing an SSD

If you want to increase the capacity of the DGX Station RAID array, you can add four SSDs to the empty drive bays in the DGX Station. If an SSD in the DGX Station fails, replace the SSD to return the system to operation.

CAUTION:
The default RAID level of the array in the DGX Station is RAID 0, which does not provide any redundancy. If a single SSD in the array fails, all data stored on the array is lost. To prevent the failure of an SSD from causing a loss of data, ensure that any data on the array that you want to preserve is backed up.

If you are adding SSDs to the DGX Station, ensure that the following prerequisites are met:

  • Your DGX Station is running one of the following software releases:

    • DGX OS Desktop 4.4.0 or later
    • DGX software for Red Hat Enterprise Linux or CentOS EL7-20.02 or later
  • You obtained them from NVIDIA Enterprise Support.

    SSDs obtained from NVIDIA Enterprise Support are qualified for use with the DGX Station and are supplied with the necessary screws to secure them to their drive trays.

  1. Remove the side panel on the left of the DGX Station when viewed from the rear.
    1. Push the button on the left side of the DGX Station back panel to release the panel.
    2. Lift the panel to remove it.
      CAUTION:
      To prevent damage from electrostatic discharge, avoid touching any of the components inside the DGX Station other than any components that you are replacing or servicing.
  2. On the drive tray in which you want to install the new SSD or that contains the SSD that you want to replace, press the drive-tray eject button to loosen the drive-tray latch.

    Line drawing showing the drive-tray eject button being pressed downwards

  3. Pull the drive-tray latch upwards to unseat the drive tray.

    Line drawing showing the drive-tray latch being pulled upwards

  4. Slide the drive tray upwards to completely remove it from the unit.

    Line drawing showing the drive-tray being slid upwards

  5. If you are replacing an SSD, remove the failed SSD from the drive tray.
    1. Using a Phillips screwdriver, remove the four screws attaching the SSD to the drive tray.

      Line drawing showing the screws being removed from the drive-tray

      Save the screws for the replacement SSD.

    2. Slide the SSD out of the drive tray.
  6. Slide the new or replacement SSD into the drive tray.

    Make sure that the connector is on the open edge side of the tray.



    Line drawing showing the SSD being slid into the drive-tray

  7. Secure the new or replacement SSD to the drive tray using the four screws that were supplied with the new SSD or secured the failed SSD.
  8. With the drive-tray eject button at the right, insert the drive tray into the appropriate drive bay, then slide the drive tray all the way into the drive bay.

    Line drawing showing the drive-tray being slid into the drive bay

  9. Press the drive-try latch downwards until you hear a click to completely seat the drive tray.

    Line drawing showing the drive-tray latch being pressed downwards

  10. Replace the side panel of the DGX Station.
    1. Align the bottom edge of the side panel with the bottom edge of the DGX Station.
    2. Firmly push the panel back into place to re-engage the latches.

What you need to do to return the DGX Station to service depends on whether you replaced an SSD in the RAID array the OS SSD.

4.7.5. Rebuilding or Re-Creating the DGX Station RAID Array

Failure of a single drive in a RAID 5 array is a recoverable error but the failure causes data redundancy for the array to be lost. After replacing a single failed SSD in a RAID 5 array, you must rebuild the array to restore data redundancy for the array. Failure of any number of SSDs in a RAID 0 array and failure of more than one SSD in a RAID 5 array are both unrecoverable failures. After replacing the SSDs in response to an unrecoverable failure, you must re-create the array.

If the DGX Station RAID array is degraded because one or more SSDs failed, replace each failed SSD as explained in Adding or Replacing an SSD.

The DGX Station software includes the custom script configure_raid_array.py for rebuilding or re-creating the RAID array.

  • To rebuild a RAID 5 array after replacing a single failed SSD, run the following command:

    $ sudo configure_raid_array.py -r
    Note: The time required to rebuild a RAID 5 array depends on factors such as system load, SSD capacity, and the number of SSDs in the array. Rebuilding the array of three 1.92-terabyte SSDs in the DGX Station may require several hours.

    You can monitor the progress of a long-running rebuild by examining the contents of the /proc/mdstat file:

    $ cat /proc/mdstat
    Personalities : [raid0] [linear] [multipath] [raid1] [raid6] [raid5] [raid4] [raid10]
    md0 : active raid5 sdb[0] sdd[3] sdc[1]
          3750486016 blocks super 1.2 level 5, 512k chunk, algorithm 2 [3/2] [UU_]
          [>....................]  recovery =  4.0% (75580956/1875243008) finish=438.3min speed=68419K/sec
          bitmap: 2/14 pages [8KB], 65536KB chunk
    
    unused devices: <none>

    In this example, the rebuild is 4.0% complete and the rebuild is estimated to finish in 438.3 minutes.

  • To re-create a RAID 5 array after replacing more than one failed SSD, run the following command:

    $ sudo configure_raid_array.py -c -5 -f
    CAUTION:
    Specify the -c option only if an unrecoverable failure, such as the failure of more than one SSD, has occurred. The -c option erases all data in the array.
  • To re-create a RAID 0 array after replacing any number of failed SSDs, run the following command:

    $ sudo configure_raid_array.py -c -f
The RAID array is rebuilt or re-created with the RAID level that you specified.
  • If you re-created a RAID 0 or RAID 5 array, all data that was on the array is erased after array is re-created.
  • If you rebuilt a RAID 5 array, the data on the array is preserved after array is rebuilt.
If you have re-created a RAID 0 or RAID 5 array and have a backup of data on the array that you want to preserve, restore the data from the backup.

4.7.6. Expanding the DGX Station RAID Array

After adding SSDs to the DGX Station, you must expand the RAID array to add the new SSDs to the array. The procedure for expanding the RAID array is the same for all supported RAID levels.

Add the extra SSDs to the DGX Station as explained in Adding or Replacing an SSD.

Because expanding a RAID array risks loss of data, ensure that you have a backup of data on the array that you want to preserve.

This task requires sudo privileges.

Use standard Linux operating system commands to expand the DGX Station RAID array.

  1. Obtain the device IDs of the SSDs that were added by searching for 1.8 T drives that aren’t mounted in the output from the lsblk command.
    $ lsblk

    In the following example, the device IDs of the SSDs that were added are sde, sdf, sdg, and sdh.

    $ lsblk
    NAME   MAJ:MIN RM   SIZE RO TYPE  MOUNTPOINT
    sda      8:0    0   1.8T  0 disk
    |_sda1   8:1    0   487M  0 part  /boot/efi
    |_sda2   8:2    0   1.8T  0 part  /
    sdb      8:16   0   1.8T  0 disk
    |_md0    9:0    0   5.2T  0 raid0 /raid
    sdc      8:32   0   1.8T  0 disk
    |_md0    9:0    0   5.2T  0 raid0 /raid
    sdd      8:48   0   1.8T  0 disk
    |_md0    9:0    0   5.2T  0 raid0 /raid
    sde      8:64   0   1.8T 0 disk
    sdf      8:80   0   1.8T 0 disk
    sdg      8:96   0   1.8T 0 disk
    sdh      8:112  0   1.8T 0 disk
  2. Add the SSDs that you added to the DGX Station to the RAID array.
    $ sudo mdadm --add raid-device-path ssd-device-paths
    raid-device-path
    The device path of the RAID array, for example, /dev/md0.
    ssd-device-paths
    A space-separated list of the device paths of the SSDs that you added, in which each path is of the form /dev/device-id.

    This example adds the SSDs with device paths /dev/sde, /dev/sdf, /dev/sdg, and /dev/sdh to the RAID array with device path /dev/md0.

    sudo mdadm --add /dev/md0 /dev/sde /dev/sdf /dev/sdg /dev/sdh
  3. Increase the number of devices in the RAID array to 7.
    $ sudo mdadm --grow --raid-devices=7 raid-device-path
    raid-device-path
    The device path of the RAID array, for example, /dev/md0.
    Note: Increasing the number of devices in the RAID array to 7 may require several hours or even longer. If the system crashes, is shut down, or is rebooted while the number of devices on the array is being increased, all data that was on the array is erased.

    This example increases the number of devices in the RAID array with device path /dev/md0 to 7.

    $ sudo mdadm --grow --raid-devices=7 /dev/md0
  4. Resize the file system that resides on the RAID array to use the additional physical space in the array.
    $ sudo resize2fs raid-device-path
    raid-device-path
    The device path of the RAID array, for example, /dev/md0.

    This example resizes the file system that resides on the RAID array with device path /dev/md0.

    $ sudo resize2fs /dev/md0
The RAID array is expanded with its existing RAID level. The data on the array is preserved, even if the array is a RAID 0 array.

4.7.7. Configuring the SSDs for Data Storage as an NFS Cache

As supplied from the factory, the SSDs in the DGX Station for data storage are configured for local persistent storage. If your application data is stored in remote NFS-mounted file systems, you can improve application performance by configuring the data SSDs as an NFS cache and configuring the NFS-mounted file systems to use this cache.

If any data that you want to preserve is stored on the SSDs for data storage, move this data to another file system.
  1. Optional: If the SSDs in the DGX Station for data storage are configured in a RAID 5 array, change the RAID level of the array to RAID 0.
    $ sudo configure_raid_array.py -m raid0

    Because cache data is volatile, you can use the full capacity of the RAID array without redundancy and not risk loss of any persistent data.

  2. Download information from all configured sources about the latest versions of the packages.
    $ sudo apt update
  3. Install the cachefilesd and associated DGX configuration packages, which contains the cache daemon and its associated files, such as the startup and configuration files.
    $ sudo apt install cachefilesd dgx-conf-cachefilesd
  4. Enable the cache daemon to run.
    1. In a plain text editor such as vi, nano, or gedit, open the file /etc/default/cachefilesd with sudo user privileges.

      For example:

      $ sudo vi /etc/default/cachefilesd
    2. Uncomment the RUN=yes line.
    3. Save your changes and quit the editor.
  5. Configure the cache daemon by editing the cache daemon configuration file.
    1. In a plain text editor such as vi, nano, or gedit, open the file /etc/cachefilesd.conf with sudo user privileges.

      For example:

      $ sudo vi /etc/cachefilesd.conf
    2. Set the cache directory to /raid and the FS-Cache tag to dgxcache.
      dir /raid
      tag dgxcache
    3. Set the culling limits to values that are optimized for deep learning workloads and provide the fastest throughput for training from large datasets.
      brun  25%
      bcull 15%
      bstop  5%
      frun  10%
      fcull  7%
      fstop  3%
    4. Save your changes and quit the editor.
    For information about all the options that you can set in this file, see the /etc/cachefilesd.conf man page.

    This example shows a complete /etc/cachefilesd.conf file for configuring the cache daemon for the DGX Station. The LSM security context is the default security context of the cachefilesd daemon.

    ###############################################################################
    #
    # Copyright (C) 2006,2010 Red Hat, Inc. All Rights Reserved.
    # Written by David Howells (dhowells@redhat.com)
    #
    # This program is free software; you can redistribute it and/or
    # modify it under the terms of the GNU General Public License
    # as published by the Free Software Foundation; either version
    # 2 of the License, or (at your option) any later version.
    #
    ###############################################################################
    
    dir /raid
    tag dgxcache
    brun  25%
    bcull 15%
    bstop  5%
    frun  10%
    fcull  7%
    fstop  3%
    
    # Assuming you're using SELinux with the default security policy included in
    # this package
    secctx system_u:system_r:cachefiles_kernel_t:s0
  6. Start the cache daemon.
    $ sudo systemctl start cachefilesd
  7. Confirm that the cache daemon started properly.
    $ sudo systemctl status cachefilesd
    ● cachefilesd.service - LSB: CacheFiles daemon
       Loaded: loaded (/etc/init.d/cachefilesd; generated)
       Active: active (running) since Thu 2020-01-30 18:05:39 PST; 13s ago
         Docs: man:systemd-sysv-generator(8)
      Process: 3973 ExecStop=/etc/init.d/cachefilesd stop (code=exited, status=0/SUCCESS)
      Process: 4014 ExecStart=/etc/init.d/cachefilesd start (code=exited, status=0/SUCCESS)
        Tasks: 1 (limit: 4915)
       CGroup: /system.slice/cachefilesd.service
               └─4041 /sbin/cachefilesd
    
    Jan 30 18:05:39 mydgxstation systemd[1]: Starting LSB: CacheFiles daemon.
    Jan 30 18:05:39 mydgxstation cachefilesd[4014]:  * Starting FilesCache daemon  cachefilesd
    Jan 30 18:05:39 mydgxstation cachefilesd[4039]: About to bind cache
    Jan 30 18:05:39 mydgxstation cachefilesd[4039]: Bound cache
    Jan 30 18:05:39 mydgxstation cachefilesd[4041]: Daemon Started
    Jan 30 18:05:39 mydgxstation cachefilesd[4014]:    ...done.
    Jan 30 18:05:39 mydgxstation systemd[1]: Started LSB: CacheFiles daemon.
  8. Press Ctrl+C to return to the shell prompt.

After configuring the SSDs for data storage as an NFS cache, ensure that the mount option fsc is set for all NFS-mounted file systems that you want to use the cache.

This example shows an entry in /etc/fstab for mounting a file system for which the fsc option is set.

myfileserver.example.com:/mnt/shares/dldata /var/local/dldata nfs rw,noatime,rsize=32768,wsize=32768,nolock,tcp,intr,fsc,nofail 0 0

4.7.8. Sanitizing the DGX Station Persistent Storage

Sanitizing the DGX Station persistent storage permanently destroys all the data that was stored there. After the data is destroyed, it cannot be recovered. Sanitizing the DGX Station persistent storage involves sanitizing all the SSDs for data storage and the SSD for the operating system.

Deleting all files in the DGX Station persistent storage or reformatting all the DGX Station SSDs does not sanitize the DGX Station persistent storage because data that was stored there can still be recovered by data-recovery tools.

Before sanitizing the DGX Station persistent storage, prepare the following bootable installation media:

4.7.8.1. Running an Ubuntu Desktop LiveCD Session on the DGX Station

To be able to sanitize the OS SSD, you must run an Ubuntu Desktop LiveCD session on the DGX Station from a bootable installation medium, such as a USB flash drive or DVD-ROM. You must use the Ubuntu Desktop OS instead of DGX OS Desktop because you must run the OS without installing it and DGX OS Desktop lacks this option.

To complete this task, you need a bootable installation medium, such as a USB flash drive or DVD-ROM, that contains the Ubuntu Desktop OS LiveCD image. For instructions, see Preparing your LiveCD on the Ubuntu Community Help Wiki.

  1. Shut down the DGX Station.
  2. Load the USB flash drive or DVD-ROM into the DGX Station.
    • If you are using a USB flash drive, plug it into one of the USB ports of the DGX Station.
    • If you are using a DVD-ROM, connect an external optical drive to the DGX Station and load the DVD-ROM into the drive.
  3. Power on the DGX Station.
  4. At the first NVIDIA screen to appear, press F8 to select the boot device.
  5. In the menu for selecting the boot device, use the arrow keys to select UEFI: usb-key-or-dvd-rom-name, Partition n (size) and press Enter.
  6. When the GNU GRUB menu appears, select Try Ubuntu Without Installing and press Enter. You are logged in to the Ubuntu desktop.
    Note: The standard Ubuntu OS graphics drivers are incompatible with the NVIDIA Tesla V100 GPU cards in the DGX Station. As a result, the Ubuntu desktop might be incorrectly sized and applications might not accept input from the keyboard and mouse. Overcome this compatibility issue by switching to a text-only TTY session.
  7. Press Ctrl+Alt+F2 to switch to a text-only TTY session.
  8. At the Ubuntu login prompt, log in as user ubuntu. No password is required.

4.7.8.2. Sanitizing All DGX Station SSDs

Sanitize the DGX Station persistent storage by sanitizing all the SSDs for data storage and the SSD for the operating system. All SSDs that NVIDIA supplies with the DGX Station support the ATA SANITIZE command. Therefore, you can use the Ubuntu OS command hdparm to sanitize the DGX Station SSDs.

Ensure that you are running an Ubuntu Desktop LiveCD session on the DGX Station as explained in Running an Ubuntu Desktop LiveCD Session on the DGX Station.

This task requires sudo privileges.

Perform this task in a text-only TTY session from the Ubuntu Desktop LiveCD session.
  1. Obtain the device IDs of the SSDs by searching for the word disk in the output from the lsblk command.
    $ lsblk | grep disk

    You can identify the SSDs from their size, which is much larger than the size of any removable media that might be connected to the DGX Station, such as the USB flash drive from which you are running the Ubuntu Desktop LiveCD session.

    In the following example, the device IDs of the SSDs are sda, sdb, sdc, and sdd. The device ID sde is the device ID of a USB flash drive.

    $ lsblk | grep disk
    sda      8:0    0   1.8T  0 disk
    sdb      8:16   0   1.8T  0 disk
    sdc      8:32   0   1.8T  0 disk
    sdd      8:48   0   1.8T  0 disk
    sde      8:64   1   1.9G  0 disk /cdrom
  2. Confirm that all the SSDs support the ATA SANITIZE command.

    For each SSD, run the hdparm command with the -I option.

    $ sudo hdparm -I /dev/device-id | grep SANIT
    device-id
    The device ID of the SSD, for example, sdc.

    This example confirms that SSD sdc supports the ATA SANITIZE command. The asterisk (*) in the output from the hdparm command denotes support.

    $ sudo hdparm -I /dev/sdc | grep SANIT
       * SANITIZE_ANTIFREEZE_LOCK_EXT command
       * SANITIZE feature set
  3. Issue the ATA SANITIZE command to all the SSDs.

    For each SSD, run the hdparm command with the --yes-i-know-what-i-am-doing and --sanitize-block-erase options.

    # sudo hdparm \
    --yes-i-know-what-i-am-doing \
    --sanitize-block-erase /dev/device-id
    device-id
    The device ID of the SSD, for example, sdc.

    This example issues the ATA SANITIZE command to SSD sdc.

    $ sudo hdparm \
    --yes-i-know-what-i-am-doing \
    --sanitize-block-erase /dev/sdc
    
    /dev/sdc:
    Issuing SANITIZE_BLOCK_ERASE command
    Operation started in background
    You may use `--sanitize-status` to check progress
    Sanitizing a single SSD takes several minutes.
  4. Check the status of the sanitization.

    For each SSD, run the hdparm command with the --sanitize-status option repeatedly until sanitization of the SSD is completed without error.

    $ sudo hdparm --sanitize-status /dev/device-id
    device-id
    The device ID of the SSD, for example, sdc.

    This example shows that sanitization of SSD sdc is still in progress.

    $ sudo hdparm --sanitize-status /dev/sdc
    
    /dev/sdc:
    Issuing SANITIZE_STATUS command
    Sanitize status:
        State:    SD2 Sanitize operation In Process
        Progress: 0x72aa (44%)
    

    This example confirms that sanitization of SSD sdc was completed without error.

    $ sudo hdparm --sanitize-status /dev/sdc
    
    /dev/sdc:
    Issuing SANITIZE_STATUS command
    Sanitize status:
        State:    SD0 Sanitize Idle
        Last Sanitize Operation Completed Without Error
  5. Shut down the DGX Station.
    $ sudo shutdown -P now
  6. When prompted by the Ubuntu Desktop OS, remove the installation medium and press Enter.

After sanitizing all the DGX Station SSDs, return the DGX Station to service by installing the DGX Station software and re-initializing the RAID array.

For instructions, see Installing the DGX Station Software Image from a USB Flash Drive or DVD-ROM. When you are prompted for the option for installing the DGX Station software, select Install DGX OS Desktop release and re-initialize RAID0 volume.

After installing the DGX Station software and re-initializing the RAID array, you can change the RAID level of the RAID array to RAID 5 if necessary. For instructions, see Changing the RAID Level of the RAID Array.

4.8. Restoring the DGX Station Software Image

If the DGX Station software image becomes corrupted or the OS SSD was replaced after a failure, restore the DGX Station software image to its original factory condition from a pristine copy of the image.

A USB flash drive is supplied from which you can restore the DGX Station software image. Before using this USB drive to restore the DGX Station software image, contact NVIDIA Enterprise Support to see if a later version of the software image is available. If a later version of the image is available, prepare a bootable installation medium that contains the current software image as explained in the following topics:

When you have a bootable installation medium that contains the current software image, install the image as explained in Installing the DGX Station Software Image from a USB Flash Drive or DVD-ROM.

Note:Updates to the DGX Station software might have been made available after the latest available ISO image file was created. To ensure that you have the latest DGX Station software, including security updates, check for updates and install any available updates after you restore the software image. For more information, see Upgrading Within the Same DGX OS Desktop Major Release.

4.8.1. Obtaining the DGX Station Software ISO Image and Checksum File

To ensure that you restore the latest available version of the DGX Station software image, obtain the current ISO image file from NVIDIA Enterprise Support. A checksum file is provided for the image to enable you to verify the bootable installation medium that you create from the image file.
  1. Log on to the NVIDIA Enterprise Support site.
  2. Click the Announcements tab to locate the download links for the DGX Station software image.
  3. Download the ISO image and its checksum file and save them to your local disk. The ISO image is also available in an archive file. If you download the archive file, be sure to extract the ISO image before proceeding.

4.8.2. Creating a Bootable Installation Medium

After obtaining an ISO file that contains the DGX OS Desktop software image from NVIDIA Enterprise Support, create a bootable installation medium, such as a USB flash drive or DVD-ROM, that contains the image.

  • If you are creating a bootable USB flash drive, follow the instructions for the platform that you are using:
  • If you are creating a bootable DVD-ROM, you can use any of the methods described in Burning the ISO on to a DVD on the Ubuntu Community Help Wiki.
    Note: The ISO file that contains software image for some DGX OS Desktop releases is greater than the 4.7 GB capacity of a single-layer DVD-ROM. You cannot install these releases from a bootable DVD-ROM because installation of DGX OS Desktop from a dual-layer DVD-ROM is not supported. Check the size of the ISO file that contains the DGX OS Desktop software image before creating a bootable DVD-ROM.

4.8.2.1. Creating a Bootable USB Flash Drive by Using Startup Disk Creator

On an Ubuntu Desktop system, you can use Startup Disk Creator to create a bootable USB flash drive that contains the DGX Station software image.

Ensure that the following prerequisites are met:

  1. Plug the USB flash drive into one of the USB ports of your Ubuntu Desktop system.
  2. Search for Startup Disk Creator.
    • Ubuntu 18.04 Desktop: Open Activities overview, and in the search box, type Startup Disk Creator.
    • Ubuntu 16.04 Desktop: Open the Dash, and in the search box, type Startup Disk Creator.
  3. Click the Startup Disk Creator icon.
  4. In the Make Startup Disk window that opens, from the Source disc image (.iso) list, select the DGX Station software image file.

    Screen capture of the Startup Disk Creator window showing a DGX Station software image and a USB flash drive selected.

    If the DGX Station software image file is not listed, click Other and in the window that opens, navigate to the file, select the file, and click Open.

  5. From the Disk to use list, select the USB flash drive and click Make Startup Disk.

4.8.2.2. Creating a Bootable USB Flash Drive by Using Akeo Rufus

On a Windows system, you can use the Akeo Reliable USB Formatting Utility (Rufus) to create a bootable USB flash drive that contains the DGX OS software image.

Ensure that the following prerequisites are met:

  1. Plug the USB flash drive into one of the USB ports of your Windows system.
  2. Download and launch the Akeo Reliable USB Formatting Utility (Rufus).



  3. In Drive Properties, select the following options:.
    1. In Boot selection, click SELECT, locate, and select the DGX OS software image.
    2. In Partition scheme, select GPT.
    3. In Target System, select UEFI (non CSM).
  4. In Format Options, select the following options:
    1. In File system, select NTFS.
    2. In Cluster Size, select 4096 bytes (Default).
  5. Click Start. Because the image is a hybrid ISO file, you are prompted to select whether to write the image in ISO Image (file copy) mode or DD Image (disk image) mode.



  6. Select Write in ISO Image mode and click OK.

4.8.3. Verifying the Bootable Installation Medium

On a Linux system, you can use the checksum file provided for the DGX Station software image to verify the installation medium that you created from the image.

Ensure that the following prerequisites are met:

How to verify a bootable installation medium depends on whether it is a USB flash drive or a DVD-ROM.

4.8.3.1. Verifying a Bootable USB Flash Drive

  1. Plug the USB flash drive into one of the USB ports of your Linux system.
  2. Obtain the device ID of the USB flash drive by running the lsblk command.
    $ lsblk

    You can identify the USB flash drive from its size, which is much smaller than the size of the SSDs in the DGX Station, and from the mount points of any partitions on the drive, which are under /media.

    In the following example, the device ID of the USB flash drive is sde1.

    $ lsblk
    NAME   MAJ:MIN RM  SIZE RO TYPE  MOUNTPOINT
    sda      8:0    0  1.8T  0 disk
    |_sda1   8:1    0  487M  0 part  /boot/efi
    |_sda2   8:2    0  1.8T  0 part  /
    sdb      8:16   0  1.8T  0 disk
    |_md0    9:0    0  5.2T  0 raid0 /raid
    sdc      8:32   0  1.8T  0 disk
    |_md0    9:0    0  5.2T  0 raid0 /raid
    sdd      8:48   0  1.8T  0 disk
    |_md0    9:0    0  5.2T  0 raid0 /raid
    sde      8:64   1  3.7G  0 disk
    |_sde1   8:65   1  3.2G  0 part  /media/deepl/DGXSTATION
    |_sde2   8:66   1  2.3M  0 part
    $
  3. Compute the checksum of the image on the USB flash drive.
    $ sudo dd if=device-id bs=block-size | cksum
    device-id
    The device ID of the USB flash drive, for example, /dev/sde1.
    block-size
    The block size to be used by the dd command, for example, 1M.

    This example computes the checksum of an image on the USB flash drive with device ID /dev/sde1 using a block size of 1 MB.

    $ sudo dd if=/dev/sde1 bs=1M | cksum
    3299+1 records in
    3299+1 records out
    3459317760 bytes (3.5 GB, 3.2 GiB) copied, 164.369 s, 21.0 MB/s
    3992706625 3459317760
  4. Obtain the checksum value from the checksum file.
    $ cat checksum-file
    checksum-file
    The path, including the file name, to the checksum file.

    This example obtains the checksum value for the image DGXStation-3.1.2_56d4a9.iso from the checksum file DGXStation-3.1.2_56d4a9.crc in the current working directory.

    $ cat DGXStation-3.1.2_56d4a9.crc
    3992706625 3459317760 DGXStation-3.1.2_56d4a9.iso
    If the value obtained from the checksum file matches the value computed from the image, the integrity of the installation medium has been successfully verified.

4.8.3.2. Verifying a Bootable DVD-ROM

  1. Load the DVD-ROM into an optical drive connected to your Linux system.
  2. Compute the checksum of the image on the DVD-ROM.
    $ cksum < /dev/sr0
    This example computes the checksum of an image on a DVD-ROM.
    $ cksum < /dev/sr0
    3992706625 3459317760
  3. Obtain the checksum value from the checksum file.
    $ cat checksum-file
    checksum-file
    The path, including the file name, to the checksum file.

    This example obtains the checksum value for the image DGXStation-3.1.2_56d4a9.iso from the checksum file DGXStation-3.1.2_56d4a9.crc in the current working directory.

    $ cat DGXStation-3.1.2_56d4a9.crc
    3992706625 3459317760 DGXStation-3.1.2_56d4a9.iso
    If the value obtained from the checksum file matches the value computed from the image, the integrity of the installation medium has been successfully verified.

4.8.4. Installing the DGX Station Software Image from a USB Flash Drive or DVD-ROM

Before installing the DGX Station software image, ensure that you have a bootable USB flash drive or DVD-ROM that contains the current DGX Station software image.

CAUTION:
Installing the DGX Station software image erases all data stored on the OS SSD. The /home partition, where all users' documents, software settings, bookmarks, and other personal files are stored, resides on the OS SSD and will be erased. However, if you chose to install the DGX Station software and preserve the RAID array contents, persistent data stored in the RAID array is unaffected.
  1. Shut down the DGX Station.
  2. Load the USB flash drive or DVD-ROM into the DGX Station.
    • If you are using a USB flash drive, plug it into one of the USB ports of the DGX Station.
    • If you are using a DVD-ROM, connect an external optical drive to the DGX Station and load the DVD-ROM into the drive.
  3. Power on the DGX Station.
  4. At the first NVIDIA screen to appear, press F8 to select the boot device.
  5. In the menu for selecting the boot device, use the arrow keys to select UEFI: usb-key-or-dvd-rom-name, Partition n (size) and press Enter.
  6. Boot the DGX Station install media.
  7. To complete the standard installation without root filesystem encryption, select one of the following options:
    • Install DGX OS 5.0.0: Installs DGX OS 5.0.0 and reformats the data RAID.
    • Install DGX OS 5.0.0: Without Reformatting Data RAID: Installs DGX OS 5.0.0 without reformatting the data RAID.
  8. To set up filesystem encryption, select Advanced Installation Options, and then select one of the following options:
    • Install DGX Base OS 5.0.0: Installs DGX Base OS 5.0.0 and reformats the data RAID.
    • Install DGX Base OS 5.0.0: Without Reformatting Data RAID: Installs DGX Base OS 5.0.0 without reformatting the data RAID.
    Note: For both of these options, the default root filesystem passphrase is nvidia3d.
    Here is some additional information:
    • The GRUB menu options can appear during installation or when DGX Station is reimaged at the customer site.
    • If the boot is not encrypted at the factory, you need to reimage DGX Station and then encrypt the boot.
  9. When the installation is complete, respond to the prompts to accept end user license agreements for NVIDIA software and to configure the Ubuntu OS, including creating your user name and password for logging in to the DGX Station.
  10. After the Ubuntu OS configuration is complete, log in to the DGX Station to access your Ubuntu desktop.
  11. Eject the USB flash drive or DVD-ROM.
  12. Unplug the USB flash drive or optical drive from the DGX Station.

4.9. Updating the DGX Station System BIOS

If you need to update the DGX Station system BIOS, you can obtain the current version of it from NVIDIA Support Enterprise Services.

CAUTION:

Update the system BIOS only if required to resolve an issue with the DGX Station or if you are directed to by NVIDIA Support Enterprise Services to address a specific issue. If your DGX Station is operating normally and you have not been directed by NVIDIA Support Enterprise Services, do not update the system BIOS. An error during an attempt to update the system BIOS may leave your DGX Station unable to boot.

If you must update the system BIOS, be sure to obtain the BIOS file from NVIDIA Support Enterprise Services. Do not obtain a BIOS file from the motherboard manufacturer or any other source.

To complete this task, you need a USB flash drive formatted to a single FAT 16 or FAT 32 partition.
  1. Obtain the system BIOS file.
    1. Log on to NVIDIA Enterprise Support.
    2. Click the Announcements tab to locate the download links for the archive file containing the DGX Station system BIOS file.
    3. Download the archive file and extract the system BIOS file.
  2. Copy the system BIOS file to the USB flash drive.
  3. Shut down the DGX Station.
  4. Plug the USB flash drive into one of the USB ports of the DGX Station.
  5. Power on the DGX Station.
  6. At the first NVIDIA screen to appear, press Delete or F2 to enter the UEFI BIOS setup.
  7. In the UEFI BIOS Utility - EZ Mode screen, click Advanced Mode.
  8. From the Tool menu, choose EZ 3 Flash Utility and press Enter.
  9. In the EZ 3 Flash Update screen, select via Storage Device(s) as the BIOS update method and press Enter.
  10. In the Drive list, use the up arrow and down arrow keys to select the USB flash drive that contains the BIOS file and press Enter.
  11. In the Folder list, use the up arrow and down arrow keys to select the BIOS file.
  12. Press Enter to start the BIOS update process.
    CAUTION:
    To avoid the risk of leaving your DGX Station unable to boot, do not shut down or reset the DGX Station during the BIOS update process.
  13. When the BIOS update process is complete, reboot the DGX Station.

4.10. Maintaining the GPU Liquid Cooling System

A liquid cooling system keeps the GPUs in the DGX Station within their required operating temperature range. To ensure reliable operation of the cooling system, you must maintain it periodically.

4.10.1. Monitoring GPU Temperatures

  1. Search for NVIDIA X Server Settings.
    • DGX OS Desktop 4 releases: Open Activities overview, and in the search box, type NVIDIA X Server Settings.
    • DGX OS Desktop 3 releases: Open the Dash, and in the search box, type NVIDIA X Server Settings.
  2. Click the NVIDIA X Server Settings icon.
  3. Under each GPU in the list of GPUs in the NVIDIA X Server Settings window, click Thermal Settings.

    Thermal sensor information for the GPU is displayed, including its current temperature and an indication of whether the temperature is within the GPU's operating range.



    Screen capture of the NVIDIA X Server Settings window showing GPU thermal settings.

If the GPUs are running too hot, check the level of the liquid in the GPU cooling system as explained in Checking the Level of the Liquid in the GPU Cooling System.

4.10.2. Checking the Level of the Liquid in the GPU Cooling System

In normal operation, some coolant liquid may be lost from system. Every 12 months, check the level of the liquid in the cooling system to ensure that it remains at the required level.
  1. Remove the side panel on the right of the DGX Station when viewed from the rear.
    1. Push the button on the right side of the DGX Station back panel to release the panel.

      Line drawing showing the button on the right side of the DGX Station back panel being pushed.

    2. Lift the panel to remove it.

      Line drawing showing the DGX Station side-panel being removed.

      CAUTION:
      To prevent damage from electrostatic discharge, avoid touching any of the components inside the DGX Station other than any components that you are replacing or servicing.
  2. Look at the gauge on the side of the cooling system pump to determine the level of the liquid in the cooling system.

    Line drawing of the gauge for the liquid in the GPU cooling system showing the Minimum Level indicator.

    • If level of the liquid in the cooling system is at or above the Minimum Level in the reservoir, go to the next step.
    • If the liquid has fallen below the Minimum Level in the reservoir, replenish it as explained in Replenishing the Liquid in the GPU Cooling System.
  3. Replace the side panel of the DGX Station.
    1. Align the bottom edge of the side panel with the bottom edge of the DGX Station.

      Line drawing showing the side-panel being aligned with the bottom edge of the DGX Station.

    2. Firmly push the panel back into place to re-engage the latch.

      Line drawing showing the DGX Station side-panel latch being re-engaged.

4.10.3. Replenishing the Liquid in the GPU Cooling System

Replenish the liquid in the GPU cooling system if the liquid is below the required level or to refill the cooling system after draining it to renew the cooling liquid.

Contact NVIDIA Enterprise Support to obtain a DGX Station coolant kit, which contains:

  • 6 mm Allen wrench

  • 1 bottle of EK-CryoFuel Clear Premix coolant

    CAUTION:
    Use only the coolant that is supplied with the kit. Do not use any other type of coolant. Use of other types of coolant will void the DGX Station hardware warranty and may cause damage to or impair the performance of the system.
  • Flexible plastic filling bottle with delivery tube

  1. Ensure that the DGX Station is powered off.
  2. Fill the plastic filling bottle with the mixture.
  3. Use the Torx T20 Allen wrench to loosen the filler cap at top of the cooling system pump and when the cap is loose, remove it.

    Line drawing showing the filler cap of the GPU cooling system being removed.

  4. Insert the delivery tube of the filling bottle into the open filler cap at the top of the pump.
  5. Gently squeeze the filler bottle to dispense the coolant liquid into the pump until the liquid reaches the Maximum Level in the reservoir.

    Line drawing showing the coolant liquid being added to the GPU cooling system from a filler bottle.

  6. Replace the filler cap at top of the pump and use the Torx T20 Allen wrench to tighten the cap until it is finger tight. Do not over tighten the filler cap.
  7. Power on the DGX Station and let it run for one minute. If the pump makes a grinding noise, power off and power on the DGX Station four times.
  8. Ensure that the level of the liquid in the cooling system is at the Maximum Level in the reservoir.

    Line drawing of the gauge for the liquid in the GPU cooling system showing the Maximum Level indicator.

    If the liquid has fallen below the Maximum Level in the reservoir, repeat the following sequence of steps until level of the liquid in the cooling system remains at the Maximum Level.

    1. Remove the filler cap at top of the cooling system pump.
    2. Dispense more coolant liquid into the pump until the liquid reaches the Maximum Level in the reservoir again.
    3. Replace the filler cap at top of the pump.
    4. Power on the DGX Station and let it run for one minute.
    5. Check the level of the liquid in the cooling system.
  9. Power off the DGX Station.
  10. Replace the side panel of the DGX Station.
    1. Align the bottom edge of the side panel with the bottom edge of the DGX Station.

      Line drawing showing the side-panel being aligned with the bottom edge of the DGX Station.

    2. Firmly push the panel back into place to re-engage the latch.

      Line drawing showing the DGX Station side-panel latch being re-engaged.

A. Safety

To reduce the risk of bodily injury, electrical shock, fire, and equipment damage, read this document and observe all warnings and precautions in this guide before installing or maintaining your product. NVIDIA products are designed to operate safely when installed and used according to the product instructions and general safety practices. The guidelines included in this document explain the potential risks associated with computer operation and provide important safety practices designed to minimize these risks.

The product is designed and tested to meet IEC 60950-1, the Standard for the Safety of Information Technology Equipment. This also covers the national implementation of IEC 60950-1 based safety standards around the world, for example, UL 60950-1. These standards reduce the risk of injury from the following hazards:

  • Electric shock: Hazardous voltage levels contained in parts of the product
  • Fire: Overload, temperature, material flammability
  • Mechanical: Sharp edges, moving parts, instability
  • Energy: Circuits with high energy levels (240 volt amperes) or potential as burn hazards
  • Heat: Accessible parts of the product at high temperatures
  • Chemical: Chemical fumes and vapors
  • Radiation: Noise, ionizing, laser, ultrasonic waves

Retain and follow all product safety and operating instructions. Always refer to the documentation supplied with your equipment. Observe all warnings on the product and in the operating instructions.



Warning symbol

WARNING: FAILURE TO FOLLOW THESE SAFETY INSTRUCTIONS COULD RESULT IN FIRE, ELECTRIC SHOCK OR OTHER INJURY OR DAMAGE. ELECTRICAL EQUIPMENT CAN BE HAZARDOUS IF MISUSED. OPERATION OF THIS PRODUCT, OR SIMILAR PRODUCTS, MUST ALWAYS BE SUPERVISED BY AN ADULT. DO NOT ALLOW CHILDREN ACCESS TO THE INTERIOR OF ANY ELECTRICAL PRODUCT AND DO NOT PERMIT THEM TO HANDLE ANY CABLES.

A.1. Intended Application Uses

This product was evaluated as Information Technology Equipment (ITE), which may be installed in offices, schools, computer rooms, and similar commercial type locations. The suitability of this product for other product categories and environments (such as medical, industrial, residential, alarm systems, and test equipment), other than an ITE application, may require further evaluation.

A.2. General Precautions

To reduce the risk of personal injury or damage to the equipment:

  • Shut down the product and disconnect all AC power cables before installation.
  • Do not connect or disconnect any cables when performing installation, maintenance, or reconfiguration of this product during an electrical storm.
  • Never turn on any equipment when there is evidence of fire, water, or structural damage.
  • Place the product away from radiators, heat registers, stoves, amplifiers, or other products that produce heat.
  • Never use the product in a wet location.
  • Avoid inserting foreign objects through openings in the product.
  • Do not use conductive tools that could bridge live parts.
  • Do not make mechanical or electrical modifications to the equipment.
  • Use the product only with approved equipment.
  • Follow all cautions and instructions marked on the equipment. Do not attempt to defeat safety interlocks (where provided).
  • Operate the DGX Station in a place where the temperature is always in the range 10°C to 30°C (50°F to 86°F).

A.3. Electrical Precautions

Power Cable

To reduce the risk of electric shock, fire, or damage to the equipment:

  • Use only the supplied power cable and do not use this power cable with any other products or for any other purpose. Not all power cables have the same current ratings.
  • Do not use household extension cables with your product. Household extension cables do not have overload protection and are not intended for use with computer systems.
  • If you lose or damage the supplied power cable, or have to change the power cable for any reason, use a cable rated for your product and for the voltage and current marked on the electrical ratings label of the product. The voltage and current rating of the cable must be greater than the voltage and current rating marked on the product.
  • Plug the power cable into a grounded (earthed) electrical outlet that is easily accessible at all times. The product is equipped with a three-wire electrical grounding-type plug which has a third pin for ground. This plug fits only into a grounded electrical power outlet.
  • Do not disable the power cable grounding plug. The grounding plug is an important safety feature.
  • Do not place objects on power cables. Arrange them so that no one may accidentally step on or trip over them.
  • Do not pull on a cable. When unplugging the product from the electrical outlet, grasp the plug.
  • When possible, use one hand only to connect or disconnect cables.
  • Do not modify power cables or plugs. Consult a licensed electrician or your power company for site modifications.

Power Supply

  • Ensure that the voltage and frequency of your power source match the voltage and frequency inscribed on the equipment’s electrical rating label. If you have a question about the type of power source to use, contact your authorized service provider.
  • Connect the equipment to a properly wired and grounded electrical outlet and always follow your local or national wiring rules.
  • Ensure that the socket outlet is near the equipment and is readily accessible for disconnection.
  • To help protect your system from sudden, transient increases and decreases in electrical power, consider using a surge suppressor or line conditioner.
  • Never force a connector into a port. Check for obstructions on the port. If the connector and port don’t join with reasonable ease, they probably don’t match. Make sure that the connector matches the port and that you have positioned the connector correctly in relation to the port.
  • Do not open the power supply. Hazardous voltage, current and energy levels are present inside the power supply. The power supply in this product contains no user-serviceable parts. Return to manufacturer for servicing.

A.4. Communications Cable Precautions

To reduce the risk of exposure to electrical shock hazards from communications cables:

  • Do not connect communications cables during an electrical storm. There may be a risk of electric shock from lightning.
  • Do not connect or use communications cables in a wet location.
  • Disconnect the communications cables before opening a product enclosure, or touching or installing internal components.

A.5. Other Hazards

Proposition 65 Warning

This product contains chemicals known to the State of California to cause cancer and birth defects or other reproductive harm.

California Department of Toxic Substances Control

Perchlorate Material – special handling may apply. See www.dtsc.ca.gov/hazardouswaste/perchlorate.

Perchlorate Material: Lithium battery (CR2032) contains perchlorate. Please follow instructions for disposal.

Nickel

Nickel safety warning symbol

The decorative metal foam on the DGX Station casework contains some nickel. The metal foam is not intended for direct and prolonged skin contact. While nickel exposure is unlikely to be a problem, you should be aware of the possibility in case you’re susceptible to nickel-related reactions.

B. Connections, Controls, and Indicators

B.1. Front-Panel Connections and Controls

ID Type Qty Description
1 Power Button 1 Press to turn the DGX Station on or off
Line drawing showing the front-panel connections and controls for DGX Station.

B.2. Rear-Panel Connections and Controls

Current Units

ID Type Qty Description
1 USB 3.1 Type-C 1 USB 3.1 Type-C port
2 Ethernet 2 10G LAN ports (see LAN Port Indicators):
  • Lower port: LAN 1
  • Upper port: LAN 2
3 USB 3.0 4 USB 3.0 ports
4 S/PDIF Audio Output 1 Optical S/PDIF out port
5 eSATA 2 eSATA ports for connecting external storage devices, such as hard drives or optical drives, with an external power supply
6 AC Input 1 Power supply input
7 Reset Button 1 Press to reboot the system without turning off the system power
8 USB 3.1 Type-A 1 USB 3.1 Type-A port
9 Audio I/O 5 3.5 mm I/O ports for 2-, 4-, 6-, or 8-channel audio (see Audio I/O Connections)
10 DisplayPort 3 Ports for connecting up to 3 displays
11 Power Supply Switch 1 Turn the power supply on and off
Line drawing showing the rear-panel connections and controls for DGX Station.

Earlier Units

ID Type Qty Description
1 USB 3.1 Type-C 1 USB 3.1 Type-C port
2 Ethernet 2 10G LAN ports (see LAN Port Indicators):
  • Lower port: LAN 1
  • Upper port: LAN 2
3 USB 3.0 4 USB 3.0 ports
4 S/PDIF Audio Output 1 Optical S/PDIF out port
5 eSATA 2 eSATA ports for connecting external storage devices, such as hard drives or optical drives, with an external power supply
6 Power Supply Switch 1 Turn the power supply on and off
7 Reset Button 1 Press to reboot the system without turning off the system power
8 USB 3.1 Type-A 1 USB 3.1 Type-A port
9 Audio I/O 5 3.5 mm I/O ports for 2-, 4-, 6-, or 8-channel audio (see Audio I/O Connections)
10 DisplayPort 3 Ports for connecting up to 3 displays
11 AC Input 1 Power supply input
Line drawing showing the rear-panel connections and controls for earlier DGX Station units.

B.3. LAN Port Indicators

LEDs on each Ethernet LAN port indicate the connection status as illustrated in the following figure and described in the following tables.



Line drawing showing a LAN port and its actibity and speed LED indicators


Speed LED

Status Description
Off 100 Mbps connection
Orange 1 Gbps connection
Green 10 Gbps connection

Activity/Link LED

Status Description
Off No link
Green Linked
Green (blinking) Data activity

B.4. Audio I/O Connections

ID Port Color 2-Channel 4-Channel 6-Channel 8-Channel
1 Pink Mic In Mic In Mic In Mic In
2 Black N/A Rear Speaker Rear Speaker Rear Speaker
3 Orange N/A N/A Center/Subwoofer Center/Subwoofer
4 Light Blue Line In Line In Line In Side Speaker
5 Lime Green Line Out Front Speaker Front Speaker Front Speaker
Line drawing showing the deatils of the rear-panel audio I/O connections

C. Compliance

The NVIDIA DGX Station is compliant with the regulations listed in this section.

C.4. Brazil

INMETRO

C.5. Canada

Innovation, Science and Economic Development Canada (ISED)

CAN ICES-3(A)/NMB-3(A)

The Class A digital apparatus meets all requirements of the Canadian Interference-Causing Equipment Regulation.

Cet appareil numérique de la classe A respecte toutes les exigences du Règlement sur le matériel brouilleur du Canada.

C.6. China

RoHS Material Content

European Union

European Conformity; Conformité Européenne (CE)

This is a Class A product. In a domestic environment this product may cause radio frequency interference in which case the user may be required to take adequate measures.

The product has been marked with the CE Mark to illustrate its compliance.

This device complies with the following Directives:

  • EMC Directive (2014/30/EU) for Class A, I.T.E equipment.
  • Low Voltage Directive (2014/35/EU) for electrical safety.
  • RoHS Directive (2011/65/EU) for hazardous substances.
  • ErP Directive (2009/125/EC) for European Ecodesign.

A copy of the Declaration of Conformity to the essential requirements may be obtained directly from NVIDIA GmbH (Floessergasse 2, 81369 Munich, Germany).

C.8. India

BIS

Self Declaration - Conforming to IS13252:2010, R-41078743

Russia

CU-TR

C.12. South Africa

LOA

Compliant with SANS IEC 60950

SABS

Compliant with SANS 222 CISPR 22

C.15. United States

Federal Communications Commission (FCC)

FCC Marking (Class A)

This device complies with part 15 of the FCC Rules. Operation is subject to the following two conditions: (1) this device may not cause harmful interference, and (2) this device must accept any interference received, including any interference that may cause undesired operation of the device.

NOTE: This equipment has been tested and found to comply with the limits for a Class A digital device, pursuant to part 15 of the FCC Rules. These limits are designed to provide reasonable protection against harmful interference when the equipment is operated in a commercial environment. This equipment generates, uses, and can radiate radio frequency energy and, if not installed and used in accordance with the instruction manual, may cause harmful interference to radio communications. Operation of this equipment in a residential area is likely to cause harmful interference in which case the user will be required to correct the interference at his own expense.

United States/Canada

cULus Listing Mark

D. DGX Station Hardware Specifications

D.1. Environmental Conditions

Condition Operating Range Nonoperating Range
Ambient temperature 10°C to 30°C (50°F to 86°F) 5°C to 40°C (41°F to 104°F)
Relative humidity 10% to 80% (non-condensing) 8% to 90% (non-condensing)

D.2. Component Specifications

Component Qty Description
CPU 1 Intel Xeon E5-2698 v4 2.2 GHz (20-Core)
GPU- current units 4 NVIDIA Tesla V100-DGXS-32GB, featuring:
  • 4×125 TeraFLOPS (500 TeraFLOPS total), FP16
  • 4×32 GB (128 GB total) GPU memory
  • 4×640 (2,560 total) NVIDIA Tensor Cores
  • 4×5,120 (20,480 total) NVIDIA CUDA® cores
GPU - earlier units 4 NVIDIA Tesla V100-DGXS-16GB, featuring:
  • 4×125 TeraFLOPS (500 TeraFLOPS total), FP16
  • 4×16 GB (64 GB total) GPU memory
  • 4×640 (2,560 total) NVIDIA Tensor Cores
  • 4×5,120 (20,480 total) NVIDIA CUDA® cores
System memory 8 8×32 GB (256 GB total) ECC Registered RDIMM DDR4 SDRAM
Note: You can replace all eight factory-installed 32-GB DIMMs with 64-GB DIMMs to give a total capacity of 512 GB.
Data storage 3 3×1.92 TB (5.76 TB total) 2.5" 6 Gb/s SATA III SSD in RAID 0 configuration
Note:Since DGX OS Desktop 4.4.0 or DGX software for Red Hat Enterprise Linux or CentOS EL7-20.02: You can add four 1.92-TB SSDs for data storage to give a total capacity of 13.44 TB in a RAID 0 configuration.
OS storage 1 1.92 TB 2.5" 6 Gb/s SATA III SSD

D.3. Mechanical Specifications

Specification Value
Height 25” (639 mm)
Width 10” (256 mm)
Depth 20” (518 mm)
Gross weight 88 lbs (40 kg)

D.4. Power Specifications

Input Comments
115 - 240 VAC, 12-8A, (50 - 60 Hz) The DGX Station power consumption can reach 1,500 W (ambient temperature 30°C) with all system resources under a heavy load.

Be aware of your electrical source’s power capability to avoid overloading the circuit.

E. Customer Support for the NVIDIA DGX Station

Contact NVIDIA Enterprise Support for assistance in reporting, troubleshooting, or diagnosing problems with your DGX Station system.

For details on how to obtain support, visit the NVIDIA Enterprise Support web site (https://www.nvidia.com/en-us/support/enterprise/).

Our support team can help collect appropriate information about your issue and involve internal resources as needed.