DGX Software for Red Hat Enterprise Linux 7 Release Notes

This document describes the key features, software enhancements and improvements, and known issues for the NVIDIA DGX Software for Red Hat Enterprise Linux 7.

1. DGX Software For Red Hat Enterprise Linux 7 Overview

NVIDIA provides a NVIDIA® DGX™ software stack targeted for installation on DGX systems installed with Red Hat Enterprise Linux. The software stack provides the same features and functionality that are provided by the original DGX OS server software built upon the Ubuntu operating system. See also the DGX Software on Red Hat Enterprise Linux 7 Installation Guide.

2. Version EL7-18.11

The DGX Software for Red Hat Enterprise Linux 7 - Version EL7-18.11- is available.

Software Contents

The following table provides version information for software included in the DGX Software Stack for Red Hat Enterprise Linux 7 and Red Hat-derived operating systems.

Note: Unlike the DGX OS shipped with the NVIDIA DGX-1, the DGX software stack for Red Hat-derived operating systems does not include the Mellanox OpenFabrics Enterprise Distribution (MLNX_OFED) for Linux. This is due to the likelihood of the MLNX_OFED kernel  being out of sync with the Red Hat distribution kernel. This can result in system instability. To use InfiniBand on the DGX-1, see the DGX Software for Red Hat Enterprise Linux 7 Installation Guide for instructions.
Component Version
DGX Software EL7-18.11
GPU Driver 410.79
NVIDIA System Health Monitor (NVSM)

nvsm-cli 18.10.6-1.el7.x86_64

nvsm-dshm 18.12-2.el7.noarch

nvsm-apis 18.10.11-1.el7.x86_64

nvsysinfo 18.10.5-1.el7.x86_64

nvhealth 18.10.10-1.el7.x86_64

Data Center GPU Management (DCGM) 1.5.3-1
NCCL Runtime 2.3.7-1
cuDNN Library Runtime 7.3.1.20-1
TensorRT 5.0.2.6-1
CUDA Toolkit 10.0

Compatibility

NVIDIA has validated and tested the DGX Software version EL7-18.11 on the
  • NVIDIA DGX-1 (Tesla V100) with
  • Red Hat Enterprise Linux 7.5.

2.1. Known Issues

2.1.1. Black Screen on BMC Remote Console with Red Hat Enterprise Linux 7.5

Issue

After installing Red Hat Enterprise Linux 7.5 and booting to the command line, the video output might display only a black screen and not show any regular characters (only bold or colored characters might be printed).

Workaround

Provide the additional ast.modeset=0 option to the kernel as follows.

  1. Boot the system, then select Install Red Hat Enterprise Linux 7.5 from the grub menu and then press ‘e’ to edit the boot command.

  2. Move the cursor down to the boot command line and add ast.modeset=0 anywhere after the Linux boot image name “linuxefi /vmlinuz-<version> “ as indicated in the following image.

  3. Press Ctrl-x to boot the kernel with the modified setting.

    All characters should now be visible during the boot process and terminal log-in.

Until you complete the installation of the “DGX Configurations” software group, you will need to perform these steps any time you reboot the system. After installing the “DGX Configurations” software group, the software adds the modeset setting permanently and you no longer need to perform the steps manually.

2.1.2.  NVSM CLI Returns HTTP Code 500 Error After Hot-Plugging a Previously Removed SSD

Issue

After removing one of the cache SSDs from the DGX-1, checking the status using NVSM CLI, and then hot-plugging the SSD back in, NVSM CLI reports an HTTP code 500 error.

Example, where drive 20:4 is the reinserted SSD (20 is the enclosure ID and 4 is the drive slot):

nvsm-> show /systems/localhost/storage/drives/20:4
/systems/localhost/storage/drives/20:4
ERROR:nvsm:Bad HTTP status code "500" from NVSM backend: Internal Server Error 

Explanation and Workaround

After re-inserting the SSD back into the system, NVSM recognizes the drive but fails to get full device information from storCLI.  Additionally, the RAID controller sets the array to offline and marks the re-inserted SSD as Unconfigured_Bad (UBad).  This prevents the RAID 0 array from being recreated.

To correct this condition,

  1. Set the drive back to a good state.
    # sudo /opt/MegaRAID/storcli/storcli64 /c0/e<enclosure_id>/s<drive_slot> set good force
  2. Run the script to recreate the array.
     # sudo configure_raid_array.py -c -f  

Notices

Notice

THE INFORMATION IN THIS GUIDE AND ALL OTHER INFORMATION CONTAINED IN NVIDIA DOCUMENTATION REFERENCED IN THIS GUIDE IS PROVIDED “AS IS.” NVIDIA MAKES NO WARRANTIES, EXPRESSED, IMPLIED, STATUTORY, OR OTHERWISE WITH RESPECT TO THE INFORMATION FOR THE PRODUCT, AND EXPRESSLY DISCLAIMS ALL IMPLIED WARRANTIES OF NONINFRINGEMENT, MERCHANTABILITY, AND FITNESS FOR A PARTICULAR PURPOSE. Notwithstanding any damages that customer might incur for any reason whatsoever, NVIDIA’s aggregate and cumulative liability towards customer for the product described in this guide shall be limited in accordance with the NVIDIA terms and conditions of sale for the product.

THE NVIDIA PRODUCT DESCRIBED IN THIS GUIDE IS NOT FAULT TOLERANT AND IS NOT DESIGNED, MANUFACTURED OR INTENDED FOR USE IN CONNECTION WITH THE DESIGN, CONSTRUCTION, MAINTENANCE, AND/OR OPERATION OF ANY SYSTEM WHERE THE USE OR A FAILURE OF SUCH SYSTEM COULD RESULT IN A SITUATION THAT THREATENS THE SAFETY OF HUMAN LIFE OR SEVERE PHYSICAL HARM OR PROPERTY DAMAGE (INCLUDING, FOR EXAMPLE, USE IN CONNECTION WITH ANY NUCLEAR, AVIONICS, LIFE SUPPORT OR OTHER LIFE CRITICAL APPLICATION). NVIDIA EXPRESSLY DISCLAIMS ANY EXPRESS OR IMPLIED WARRANTY OF FITNESS FOR SUCH HIGH RISK USES. NVIDIA SHALL NOT BE LIABLE TO CUSTOMER OR ANY THIRD PARTY, IN WHOLE OR IN PART, FOR ANY CLAIMS OR DAMAGES ARISING FROM SUCH HIGH RISK USES.

NVIDIA makes no representation or warranty that the product described in this guide will be suitable for any specified use without further testing or modification. Testing of all parameters of each product is not necessarily performed by NVIDIA. It is customer’s sole responsibility to ensure the product is suitable and fit for the application planned by customer and to do the necessary testing for the application in order to avoid a default of the application or the product. Weaknesses in customer’s product designs may affect the quality and reliability of the NVIDIA product and may result in additional or different conditions and/or requirements beyond those contained in this guide. NVIDIA does not accept any liability related to any default, damage, costs or problem which may be based on or attributable to: (i) the use of the NVIDIA product in any manner that is contrary to this guide, or (ii) customer product designs.

Other than the right for customer to use the information in this guide with the product, no other license, either expressed or implied, is hereby granted by NVIDIA under this guide. Reproduction of information in this guide is permissible only if reproduction is approved by NVIDIA in writing, is reproduced without alteration, and is accompanied by all associated conditions, limitations, and notices.

Trademarks

NVIDIA, the NVIDIA logo, DGX, DGX-1, DGX-2, and DGX Station are trademarks and/or registered trademarks of NVIDIA Corporation in the Unites States and other countries. Other company and product names may be trademarks of the respective companies with which they are associated.