Installing the DGX Software

This section requires that you have already installed Red Hat Enterprise Linux" or derived operating system on the DGX server.

Configuring a System Proxy

If your network requires use of a proxy, then edit the file /etc/yum.conf and make sure the following lines are present in the [main] section, using the parameters that apply to your network:

proxy=http://<Proxy-Server-IP-Address>:<Proxy-Port> 
proxy_username=<Proxy-User-Name>
proxy_password=<Proxy-Password>

Enabling the Repositories

  1. On Red Hat Enterprise Linux, run the following commands to enable additional repositories required by the DGX software.
    sudo subscription-manager repos --enable=rhel-7-server-extras-rpms
    sudo subscription-manager repos --enable=rhel-7-server-optional-rpms
  2. Enable the DGX software repository.

    Instructions for enabling the DGX software repository are provided in the document DGX-Software-Stack-for-Red-Hat-Enterprise-Linux-on-DGX (available to DGX customers with an NVIDIA Enterprise Support account).

Installing Required Components

  1. Install DGX tools and configuration files.
    • For DGX-1, install DGX-1 Configurations.
      sudo yum groupinstall -y 'DGX-1 Configurations'
    • For the DGX-2, install DGX-2 Configurations.
      sudo yum groupinstall -y 'DGX-2 Configurations' 

    The configuration changes take effect only after rebooting the system. To minimize extra reboots, defer this step until after the drivers have been installed.

  2. Configure the /raid partition for use as a data cache for NFS mounted directories. The DGX servers use a RAID 0 array, mounted at /raid, for caching NFS reads.
    1. Configure the RAID array.

      This will create the RAID group, mount it to /raid, and create an appropriate entry in /etc/fstab.

      sudo configure_raid_array.py -c -f 
      Note:

      The RAID array must be configured before installing dgx-conf-cachefilesd, which places the proper SELinux label on the /raid directory. If you ever need to recreate the RAID array - which will wipe out any labeling on /raid - after dgx-conf-cachefilesd has already been installed, be sure to restore the label manually before restarting cachefilesd.

      sudo restorecon /raid
      sudo systemctl restart cachefilesd
    2. Install dgx-conf-cachefilesd to update the cachefilesd configuration to use the /raid partition.
      sudo yum install -y dgx-conf-cachefilesd
  3. Install the NVIDIA CUDA drivers
    1. Install the kernel-devel package

      The kernel-devel package provides kernel headers required for the NVIDIA CUDA driver. Use the following command to install the kernel headers for the kernel version that is currently running on the system.

      sudo yum install -y "kernel-devel-uname-r == $(uname -r)"
    2. Install the cuda-drivers package.

      This will build and install the driver kernel modules. The installation of the dkms-nvidia package can take approximately five minutes.

      sudo yum install -y cuda-drivers dgx-persistence-mode
      Note: Red Hat Enterprise Linux 7.5 ships with OpenGL libraries that conflict with versions included in the CUDA drivers. Depending on the Software Selection performed in Installing Red Hat Enterprise Linux, you might encounter an error with the following libraries: mesa-libGL, mesa-libEGL, or mesa-libGLES. Simply remove these libraries and re-issue the yum install command.
      sudo rpm -e mesa-libGL.x86_64 --nodeps
      sudo rpm -e mesa-libEGL.x86_64 --nodeps
      sudo rpm -e mesa-libGLES.x86_64 --nodeps
      sudo yum install -y cuda-drivers cuda-drivers-diagnostic dgx-persistence-mode
  4. Reboot the systems to load the drivers and to update system configurations.
    1. Issue reboot
      sudo reboot
    2. After the server has rebooted, verify that the drivers have been loaded and are handling the NVIDIA devices.
      nvidia-smi

      The output should show all available GPUs.

      Example: Output from a DGX-1 system
      +-----------------------------------------------------------------------+
      | NVIDIA-SMI 410.79       Driver Version: 410.79    CUDA Version: 10.0  |
      |----------------------------+-------------------+----------------------+
      | GPU Name     Persistence-M | Bus-Id     Disp.A | Volatile Uncorr. ECC |
      | Fan Temp Perf Pwr:Usage/Cap|      Memory-Usage | GPU-Util  Compute M. |
      |============================+===================+======================|
      |   0 Tesla V100-SXM2...  On | ...00:06:00.0 Off |                    0 |
      | N/A  33C   P0   45W / 300W |   0MiB / 32480MiB |      0%      Default |
      +----------------------------+-------------------+----------------------+
      |   1 Tesla V100-SXM2...  On | ...00:07:00.0 Off |                    0 |
      | N/A  35C   P0   44W / 300W |   0MiB / 32480MiB |      0%      Default |
      +----------------------------+-------------------+----------------------+
      :                            :                   :                      :
      +----------------------------+-------------------+----------------------+
      |   7 Tesla V100-SXM2...  On | ...00:8A:00.0 Off |                    0 |
      | N/A  34C   P0   44W / 300W |   0MiB / 32480MiB |      0%      Default |
      +----------------------------+-------------------+----------------------+
      +-----------------------------------------------------------------------+
      | Processes:                                                 GPU Memory |
      |  GPU       PID   Type   Process name                       Usage      |
      |=======================================================================|
      |  No running processes found                                           |
      +-----------------------------------------------------------------------+
  5. Install the NVIDIA Container Runtime.
    1. Install Docker 1.13 from the rhel-7-server-extras-rpms repository.
      sudo yum install -y docker
    2. Install the NVIDIA Container Runtime group.
      sudo yum groupinstall -y 'NVIDIA Container Runtime'
    3. Run the following command to verify the installation.
      sudo docker run --security-opt label=type:nvidia_container_t --rm nvcr.io/nvidia/cuda nvidia-smi

      See the section Running Containers for more information about this command. For a description of nvcr.io, see the NGC Registry Spaces documentation.

      To ensure that Docker can access the NGC container registry through a proxy, refer to the Red Hat customer portal knowledge base article Configure Docker to use a proxy with or without authentication.

      The output should show all available GPUs.

      +-----------------------------------------------------------------------+
      | NVIDIA-SMI 410.79       Driver Version: 410.79    CUDA Version: 10.0  |
      |----------------------------+-------------------+----------------------+
      | GPU Name     Persistence-M | Bus-Id     Disp.A | Volatile Uncorr. ECC |
      | Fan Temp Perf Pwr:Usage/Cap|      Memory-Usage | GPU-Util  Compute M. |
      |============================+===================+======================|
      |   0 Tesla V100-SXM2...  On | ...00:06:00.0 Off |                    0 |
      | N/A  33C   P0   45W / 300W |   0MiB / 32480MiB |      0%      Default |
      +----------------------------+-------------------+----------------------+
      |   1 Tesla V100-SXM2...  On | ...00:07:00.0 Off |                    0 |
      | N/A  35C   P0   44W / 300W |   0MiB / 32480MiB |      0%      Default |
      +----------------------------+-------------------+----------------------+
      :                            :                   :                      :
      +----------------------------+-------------------+----------------------+
      |   7 Tesla V100-SXM2...  On | ...00:8A:00.0 Off |                    0 |
      | N/A  34C   P0   44W / 300W |   0MiB / 32480MiB |      0%      Default |
      +----------------------------+-------------------+----------------------+
      +-----------------------------------------------------------------------+
      | Processes:                                                 GPU Memory |
      |  GPU       PID   Type   Process name                       Usage      |
      |=======================================================================|
      |  No running processes found                                           |
      +-----------------------------------------------------------------------+

Installing Diagnostic Components

NVIDIA System Management (NVSM) is a software framework for monitoring NVIDIA DGX nodes in a data center. It includes active health monitoring, system alerts, and log generation.
Note: The NVIDIA System Management tools require Python 3. It is available from the Red Hat Enterprise Linux Software Collections. The Fedora EPEL repository also contains a version of Python 3; however, this combination has not been tested.
  1. Enable the Red Hat Software Collections repository.
    sudo subscription-manager repos --enable=rhel-server-rhscl-7-rpms

    If you do not have access to the Red Hat Software Collections repository, refer to https://access.redhat.com/solutions/472793 for instructions on requesting access for free.

  2. Install Python 3.6.
    sudo yum install -y rh-python36
    Important: NVSM is not supported with the python3 package. Be sure to only install the rh-python36 package.
  3. Install DGX System Management tools that includes the NVSM tool.
    sudo yum groupinstall -y 'DGX System Management'

Replicating the EFI System Partition on DGX-2

This section applies only to the NVIDIA DGX-2.
Once the 'DGX System Management' group is installed, the 'nvsm' tool can be used to replicate the EFI system partition (ESP) onto the second M.2 drive.
Important: Run these steps ONLY IF
  • You are installing Red Hat Enterprise Linux on the NVIDIA DGX-2, and
  • You installed Red Hat Enterprise Linux on the RAID 1 array per instructions in the section Installing on DGX-2.
  1. Start the NVSM tool.
    sudo nvsm
  2. Navigate to /systems/localhost/storage/volumes/md0.
    nvsm-> cd /systems/localhost/storage/volumes/md0
  3. Start the rebuild process.
    nvsm(/systems/localhost/storage/volumes/md0)-> start rebuild
    1. At the first prompt, specify the second M.2 disk.
      PROMPT: In order to rebuild this volume, a spare drive
              is required. Please specify the spare drive to
              use to rebuild md0.
      Name of spare drive for md0 rebuild (CTRL-C to cancel): nvme1n1
      This should be the M.2 disk on which you did NOT install the ESP.  If you followed the instructions in the section Installing on DGX-2, this should be 'nvme1n1'
    2. At the second prompt, confirm that you want to proceed.
      WARNING: Once the volume rebuild process is started, the
               process cannot be stopped.
      Start RAID-1 rebuild on md0? [y/n] y
      Upon successful completion, the following message should appear indicating that the ESP has been replicated:
      /systems/localhost/storage/volumes/md0/rebuild started at 2019-03-07 14:40:55.844542
      RAID-1 rebuild exit status: ESP_REBUILT
      If necessary, the RAID 1 array is rebuilt after the ESP has been replicated.
      Finished rebuilding RAID-1 on volume md0
      100.0% [=========================================]
      Status: Done

Installing Optional Components

The DGX is fully functional after installing the components as described in Installing Required Components. If you intend to launch NGC containers (which incorporate the CUDA toolkit, NCCL, cuDNN, and TensorRT) on the DGX system, which is the expected use case, then you can skip this section.
If you intend to use your DGX system as a development system for running deep learning applications on bare metal, then install the optional components as described in this section.
  1. Install the CUDA toolkit.
    sudo yum install cuda
  2. Install the NVIDIA Collectives Communication Library (NCCL) Runtime.
    sudo yum groupinstall 'NVIDIA Collectives Communication Library Runtime'
  3. Install the CUDA Deep Neural Networks (cuDNN) Library Runtime.
    sudo yum groupinstall 'CUDA Deep Neural Networks Library Runtime'
  4. Install NVIDIA TensorRT.
    sudo yum install tensorrt