Installing the DGX Software

This section requires that you have already installed Red Hat Enterprise Linux" or derived operating system on the DGX server.

Configuring a System Proxy

If your network requires use of a proxy, then
  • Edit the file /etc/yum.conf and make sure the following lines are present in the [main] section, using the parameters that apply to your network:
    proxy=http://<Proxy-Server-IP-Address>:<Proxy-Port> 
    proxy_username=<Proxy-User-Name>
    proxy_password=<Proxy-Password>
  • Make sure that the following domains are 'white-listed' and that the system can access them.
    • cdn.redhat.com
    • international.download.nvidia.com

Enabling the Repositories

  1. On Red Hat Enterprise Linux, run the following commands to enable additional repositories required by the DGX software.
    sudo subscription-manager repos --enable=rhel-7-server-extras-rpms
    sudo subscription-manager repos --enable=rhel-7-server-optional-rpms
  2. Run the following commands to install the DGX software installation package and enable the NVIDIA DGX software repository.

    Attention:By running these commands you are confirming that you have read and agree to be bound by the DGX Software License Agreement. You are also confirming that you understand that any pre-release software and materials available that you elect to install in a DGX may not be fully functional, may contain errors or design flaws, and may have reduced or different security, privacy, availability, and reliability standards relative to commercial versions of NVIDIA software and materials, and that you use pre-release versions at your risk.
    1. Install the NVIDIA DGX Package for Red Hat Enterprise Linux.

      yum install -y \
      https://international.download.nvidia.com/dgx/repos/rhel-files/dgx-repo-setup-19.07-2.el7.x86_64.rpm
    2. Enable the update repository.

    • Either edit /etc/yum.repos.d/nvidia-dgx-7.repo and set enabled=1,
      [nvidia-dgx-7-updates] 
      name=NVIDIA DGX EL7 Updates 
      baseurl=https://international.download.nvidia.com/dgx/repos/rhel7-updates/ 
      enabled=1 
      gpgcheck=1 gpgkey=file:///etc/pki/rpm-gpg/RPM-GPG-KEY-dgx-cosmos-support 
    • Or (if you have the yum-utils package installed), issue the following.
      sudo yum-config-manager --enable nvidia-dgx-7-updates

Installing Required Components

Installing DGX Tools and Updating Configuration Files

  1. Install DGX tools and configuration files.
    • For DGX-1, install DGX-1 Configurations.
      sudo yum groupinstall -y 'DGX-1 Configurations'
    • For the DGX-2, install DGX-2 Configurations.
      sudo yum groupinstall -y 'DGX-2 Configurations' 
    • For the DGX Station, install DGX Station Configurations.
      sudo yum groupinstall -y 'DGX Station Configurations'

    The configuration changes take effect only after rebooting the system, which is covered in the next step.

  2. Update the kernel.
    1. Issue the following.
      $ sudo yum update
      Performing this update also updates the installed Red Hat Enterprise Linux 7 distribution to the latest version. To check the latest Red Hat Enterprise Linux 7 version, visit https://access.redhat.com/articles/3078.
    2. Reboot the server into the updated kernel.
      $ sudo reboot

Configuring the /raid Partition

The DGX servers and the DGX Station include multiple SSDs for data caching or data storage. Configure these SSDs as a RAID array in a partition mounted at /raid. For the DGX servers, these SSDs are intended to be used as a data cache for NFS mounted directories. For the DGX Station, these SSDs are intended to be used either for local persistent storage or as a data cache for NFS mounted directories.

Configuring the /raid Partition as an NFS Cache

If you are using the data SSDs for caching NFS reads, configure these SSDs as a RAID 0 array, mounted at /raid and update the cachefilesd configuration to use the /raid partition.
  1. Configure the RAID array.

    This will create the RAID group, mount it to /raid, and create an appropriate entry in /etc/fstab.

    sudo configure_raid_array.py -c -f 
    Note:

    The RAID array must be configured before installing dgx-conf-cachefilesd, which places the proper SELinux label on the /raid directory. If you ever need to recreate the RAID array - which will wipe out any labeling on /raid - after dgx-conf-cachefilesd has already been installed, be sure to restore the label manually before restarting cachefilesd.

    sudo restorecon /raid
    sudo systemctl restart cachefilesd
  2. Install dgx-conf-cachefilesd to update the cachefilesd configuration to use the /raid partition.
    sudo yum install -y dgx-conf-cachefilesd

Configuring the /raid Partition for Local Persistent Storage

If you are using the data SSDs in the DGX Station for local persistent storage, configure these SSDs as a RAID 0 or RAID 5 array, mounted at /raid.

RAID 0 provides the maximum storage capacity, but does not provide any redundancy. If a single SSD in the array fails, all data stored on the array is lost. RAID 5 provides some level of protection against failure of a single SSD but with lower storage capacity than RAID 0.

  • To configure a RAID 0 array, run the following command.

    sudo configure_raid_array.py -c -f
  • To configure a RAID 5 array, run the following command.

    sudo configure_raid_array.py -c -f -5

These commands will create the RAID group, mount it to /raid, and create an appropriate entry in /etc/fstab.

Installing and Loading the NVIDIA CUDA Drivers

  1. Install the kernel-devel package

    The kernel-devel package provides kernel headers required for the NVIDIA CUDA driver. Use the following command to install the kernel headers for the kernel version that is currently running on the system.

    sudo yum install -y "kernel-devel-uname-r == $(uname -r)"
  2. Ensure that you have installed the latest version of gcc installed, as older versions may not support all of the features required to build the driver.
    sudo yum install -y gcc
  3. Install the driver package.

    This will build and install the driver kernel modules. The installation of the dkms-nvidia package can take approximately five minutes.

    sudo yum install -y cuda-drivers dgx-persistence-mode
    Note: Red Hat Enterprise Linux 7.5 ships with OpenGL libraries that conflict with versions included in the CUDA drivers. Depending on the Software Selection performed in Installing Red Hat Enterprise Linux, you might encounter an error with the following libraries: mesa-libGL, mesa-libEGL, or mesa-libGLES. Simply remove these libraries and re-issue the yum install command.
    sudo rpm -e mesa-libGL.x86_64 --nodeps
    sudo rpm -e mesa-libEGL.x86_64 --nodeps
    sudo rpm -e mesa-libGLES.x86_64 --nodeps
    sudo yum install -y cuda-drivers dgx-persistence-mode
  4. Reboot the system to load the drivers and to update system configurations.
    sudo reboot
  5. After the system has rebooted, verify that the drivers have been loaded and are handling the NVIDIA devices.
    nvidia-smi

    The output should show all available GPUs.

    Example: Output from a DGX-1 system
    +-----------------------------------------------------------------------+
    | NVIDIA-SMI 410.79       Driver Version: 410.79    CUDA Version: 10.0  |
    |----------------------------+-------------------+----------------------+
    | GPU Name     Persistence-M | Bus-Id     Disp.A | Volatile Uncorr. ECC |
    | Fan Temp Perf Pwr:Usage/Cap|      Memory-Usage | GPU-Util  Compute M. |
    |============================+===================+======================|
    |   0 Tesla V100-SXM2...  On | ...00:06:00.0 Off |                    0 |
    | N/A  33C   P0   45W / 300W |   0MiB / 32480MiB |      0%      Default |
    +----------------------------+-------------------+----------------------+
    |   1 Tesla V100-SXM2...  On | ...00:07:00.0 Off |                    0 |
    | N/A  35C   P0   44W / 300W |   0MiB / 32480MiB |      0%      Default |
    +----------------------------+-------------------+----------------------+
    :                            :                   :                      :
    +----------------------------+-------------------+----------------------+
    |   7 Tesla V100-SXM2...  On | ...00:8A:00.0 Off |                    0 |
    | N/A  34C   P0   44W / 300W |   0MiB / 32480MiB |      0%      Default |
    +----------------------------+-------------------+----------------------+
    +-----------------------------------------------------------------------+
    | Processes:                                                 GPU Memory |
    |  GPU       PID   Type   Process name                       Usage      |
    |=======================================================================|
    |  No running processes found                                           |
    +-----------------------------------------------------------------------+

Installing the NVIDIA Container Runtime

  1. Install Docker 1.13 from the rhel-7-server-extras-rpms repository.
    sudo yum install -y docker
  2. Install the NVIDIA Container Runtime group.
    sudo yum groupinstall -y 'NVIDIA Container Runtime'
  3. Run the following command to verify the installation.
    sudo docker run --security-opt label=type:nvidia_container_t --rm nvcr.io/nvidia/cuda nvidia-smi

    See the section Running Containers for more information about this command. For a description of nvcr.io, see the NGC Registry Spaces documentation.

    To ensure that Docker can access the NGC container registry through a proxy, refer to the Red Hat customer portal knowledge base article Configure Docker to use a proxy with or without authentication.

    The output should show all available GPUs.

    +-----------------------------------------------------------------------+
    | NVIDIA-SMI 410.79       Driver Version: 410.79    CUDA Version: 10.0  |
    |----------------------------+-------------------+----------------------+
    | GPU Name     Persistence-M | Bus-Id     Disp.A | Volatile Uncorr. ECC |
    | Fan Temp Perf Pwr:Usage/Cap|      Memory-Usage | GPU-Util  Compute M. |
    |============================+===================+======================|
    |   0 Tesla V100-SXM2...  On | ...00:06:00.0 Off |                    0 |
    | N/A  33C   P0   45W / 300W |   0MiB / 32480MiB |      0%      Default |
    +----------------------------+-------------------+----------------------+
    |   1 Tesla V100-SXM2...  On | ...00:07:00.0 Off |                    0 |
    | N/A  35C   P0   44W / 300W |   0MiB / 32480MiB |      0%      Default |
    +----------------------------+-------------------+----------------------+
    :                            :                   :                      :
    +----------------------------+-------------------+----------------------+
    |   7 Tesla V100-SXM2...  On | ...00:8A:00.0 Off |                    0 |
    | N/A  34C   P0   44W / 300W |   0MiB / 32480MiB |      0%      Default |
    +----------------------------+-------------------+----------------------+
    +-----------------------------------------------------------------------+
    | Processes:                                                 GPU Memory |
    |  GPU       PID   Type   Process name                       Usage      |
    |=======================================================================|
    |  No running processes found                                           |
    +-----------------------------------------------------------------------+

Installing Diagnostic Components

NVIDIA System Management (NVSM) provides the diagnostic components for NVIDIA DGX systems. NVSM is a software framework for monitoring NVIDIA DGX server nodes in a data center. It includes active health monitoring, system alerts, and log generation. The NVSM CLI can also be used for checking the health of and obtaining diagnostic information for DGX Station workstations.
Note: The diagnostic components for NVIDIA DGX systems require Python 3. It is available from the Red Hat Enterprise Linux Software Collections. The Fedora EPEL repository also contains a version of Python 3; however, this combination has not been tested.
  1. Enable the Red Hat Software Collections repository.
    sudo subscription-manager repos --enable=rhel-server-rhscl-7-rpms

    If you do not have access to the Red Hat Software Collections repository, refer to https://access.redhat.com/solutions/472793 for instructions on requesting access for free.

  2. Install Python 3.6.
    sudo yum install -y rh-python36
    Important:The diagnostic components for NVIDIA DGX systems are not supported with the python3 package. Be sure to only install the rh-python36 package.
  3. Install the DGX System Management package group.
    sudo yum groupinstall -y 'DGX System Management'
For information about using NVSM, see the NVIDIA System Management documentation.

Replicating the EFI System Partition on DGX-2

This section applies only to the NVIDIA DGX-2.
Once the 'DGX System Management' group is installed, the 'nvsm' tool can be used to replicate the EFI system partition (ESP) onto the second M.2 drive.
Important: Run these steps ONLY IF
  • You are installing Red Hat Enterprise Linux on the NVIDIA DGX-2, and
  • You installed Red Hat Enterprise Linux on the RAID 1 array per instructions in the section Installing on DGX-2.
  1. Start the NVSM tool.
    sudo nvsm
  2. Navigate to /systems/localhost/storage/volumes/md0.
    nvsm-> cd /systems/localhost/storage/volumes/md0
  3. Start the rebuild process.
    nvsm(/systems/localhost/storage/volumes/md0)-> start rebuild
    1. At the first prompt, specify the second M.2 disk.
      PROMPT: In order to rebuild this volume, a spare drive
              is required. Please specify the spare drive to
              use to rebuild md0.
      Name of spare drive for md0 rebuild (CTRL-C to cancel): nvme1n1
      This should be the M.2 disk on which you did NOT install the ESP.  If you followed the instructions in the section Installing on DGX-2, this should be 'nvme1n1'
    2. At the second prompt, confirm that you want to proceed.
      WARNING: Once the volume rebuild process is started, the
               process cannot be stopped.
      Start RAID-1 rebuild on md0? [y/n] y
      Upon successful completion, the following message should appear indicating that the ESP has been replicated:
      /systems/localhost/storage/volumes/md0/rebuild started at 2019-03-07 14:40:55.844542
      RAID-1 rebuild exit status: ESP_REBUILT
      If necessary, the RAID 1 array is rebuilt after the ESP has been replicated.
      Finished rebuilding RAID-1 on volume md0
      100.0% [=========================================]
      Status: Done

Installing Optional Components

The DGX is fully functional after installing the components as described in Installing Required Components. If you intend to launch NGC containers (which incorporate the CUDA toolkit, NCCL, cuDNN, and TensorRT) on the DGX system, which is the expected use case, then you can skip this section.
If you intend to use your DGX system as a development system for running deep learning applications on bare metal, then install the optional components as described in this section.
  1. Install the CUDA toolkit.
    sudo yum install cuda
  2. Install the NVIDIA Collectives Communication Library (NCCL) Runtime.
    sudo yum groupinstall 'NVIDIA Collectives Communication Library Runtime'
  3. Install the CUDA Deep Neural Networks (cuDNN) Library Runtime.
    sudo yum groupinstall 'CUDA Deep Neural Networks Library Runtime'
  4. Install NVIDIA TensorRT.
    sudo yum install tensorrt

Applying an NVIDIA Look and Feel to the Desktop User Interface

If the GNOME Desktop is installed, you can optionally apply an NVIDIA look and feel to the desktop user interface by applying the NVIDIA theme to applications and the shell, and using NVIDIA images for the desktop background and lock screen.
The GNOME Desktop must already be installed and running on your system. If SOFTWARE SELECTION was set to Server with GUI when you installed Red Hat Enterprise Linux, the GNOME Desktop is already installed. If the GNOME Desktop is not installed, you must install the X Window System and GNOME package groups.
  1. Install the DGX Desktop Theme package group.
    sudo yum groupinstall -y 'DGX Desktop Theme’
  2. Start gnome-tweaks.
  3. In the Appearance window that opens, under Tweaks, click Extensions.



  4. In the Extensions window that opens, set Extensions in the title bar and User themes to ON.



  5. Stop and restart gnome-tweaks.
  6. In the Appearance window that opens, apply the NVIDIA theme to applications and the shell, and use NVIDIA images for the desktop background and lock screen.
    1. Under Themes, in the drop-down lists for Applications and Shell, click Nvidia.
    2. Under Background and Lock Screen, click the Image file selector.
    3. In the Image window that opens, select an NVIDIA DGX Station background image, for example, NVIDIA_DGX_Station_Background_B.JPG, and click Open.




Managing CPU Mitigations

DGX Software for Red Hat Enterprise Linux includes security updates to mitigate CPU speculative side-channel vulnerabilities. These mitigations can decrease the performance of deep learning and machine learning workloads.

If your installation of DGX systems incorporates other measures to mitigate these vulnerabilities, such as measures at the cluster level, you can disable the CPU mitigations for individual DGX nodes and thereby increase performance. This capability is available starting with DGX Software for Red Hat Enterprise Linuxsoftware version EL7-20.02.

Determining the CPU Mitigation State of the DGX System

If you do not know whether CPU mitigations are enabled or disabled, issue the following.

$ cat /sys/devices/system/cpu/vulnerabilities/* 
  • CPU mitigations are enabled if the output consists of multiple lines prefixed with Mitigation:.

    Example

    KVM: Mitigation: Split huge pages
    Mitigation: PTE Inversion; VMX: conditional cache flushes, SMT vulnerable
    Mitigation: Clear CPU buffers; SMT vulnerable
    Mitigation: PTI
    Mitigation: Speculative Store Bypass disabled via prctl and seccomp
    Mitigation: usercopy/swapgs barriers and __user pointer sanitization
    Mitigation: Full generic retpoline, IBPB: conditional, IBRS_FW, STIBP: conditional, RSB filling
    Mitigation: Clear CPU buffers; SMT vulnerable
    
  • CPU mitigations are disabled if the output consists of multiple lines prefixed with Vulnerable.

    Example

    KVM: Vulnerable
    Mitigation: PTE Inversion; VMX: vulnerable
    Vulnerable; SMT vulnerable
    Vulnerable
    Vulnerable
    Vulnerable: __user pointer sanitization and usercopy barriers only; no swapgs barriers
    Vulnerable, IBPB: disabled, STIBP: disabled
    Vulnerable
    

Disabling CPU Mitigations

CAUTION:
Performing the following instructions will disable the CPU mitigations provided by the DGX Software for Red Hat Enterprise Linux.
  1. Apply the dgx*-no-mitigations profile.
    • On a DGX-2 system, issue
      $ sudo tuned-adm profile dgx2-no-mitigations
    • On a DGX-1 system, issue
      $ sudo tuned-adm profile dgx-no-mitigations
    • On a DGX Station workstation, issue
      $ sudo tuned-adm profile dgxstation-no-mitigations
  2. Reboot the system.
  3. Verify CPU mitigations are disabled.
    $ cat /sys/devices/system/cpu/vulnerabilities/*
    The output should include several Vulnerable lines. See Determining the CPU Mitigation State of the DGX System for example output.

Re-enabling CPU Mitigations

  1. Apply the dgx*-performance package.
    • On a DGX-2 system, issue
      $ sudo tuned-adm profile dgx2-performance
    • On a DGX-1 system, issue
      $ sudo tuned-adm profile dgx-performance
    • On a DGX Station workstation, issue
      $ sudo tuned-adm profile dgxstation-performance
  2. Reboot the system.
  3. Verify CPU mitigations are enabled.
    $ cat /sys/devices/system/cpu/vulnerabilities/*
    The output should include several Mitigations lines. See Determining the CPU Mitigation State of the DGX System for example output.