Known Issues: DGX-1

See the sections for specific versions to see which issues are open in those versions.

DKMS May not Build for New Kernel During Driver Update


When updating the driver, the DKMS module may not build for a newly installed kernel, resulitng in a driver/libray mismatch. This can be confirmed by the following output when issuing nvidia-smi:

Failed to initialize NVML: Driver/library version mismatch


Initiate a DKMS build manually by issuing the following:

$ sudo dkms install nvidia/418.67 -k $(uname -r) 

Black Screen on BMC Remote Console with Red Hat Enterprise Linux 7.5


After installing Red Hat Enterprise Linux 7.5 and booting to the command line, the video output might display only a black screen and not show any regular characters (only bold or colored characters might be printed).


Provide the additional ast.modeset=0 option to the kernel as follows.

  1. Boot the system, then select Install Red Hat Enterprise Linux 7.5 from the grub menu and then press ‘e’ to edit the boot command.

  2. Move the cursor down to the boot command line and add ast.modeset=0 anywhere after the Linux boot image name “linuxefi /vmlinuz-<version> “ as indicated in the following image.

  3. Press Ctrl-x to boot the kernel with the modified setting.

    All characters should now be visible during the boot process and terminal log-in.

Until you complete the installation of the “DGX Configurations” software group, you will need to perform these steps any time you reboot the system. After installing the “DGX Configurations” software group, the software adds the modeset setting permanently and you no longer need to perform the steps manually.

NVSM CLI Returns HTTP Code 500 Error After Hot-Plugging a Previously Removed SSD


After removing one of the cache SSDs from the DGX-1, checking the status using NVSM CLI, and then hot-plugging the SSD back in, NVSM CLI reports an HTTP code 500 error.

Example, where drive 20:4 is the reinserted SSD (20 is the enclosure ID and 4 is the drive slot):

nvsm-> show /systems/localhost/storage/drives/20:4
ERROR:nvsm:Bad HTTP status code "500" from NVSM backend: Internal Server Error 

Explanation and Workaround

After re-inserting the SSD back into the system, NVSM recognizes the drive but fails to get full device information from storCLI.  Additionally, the RAID controller sets the array to offline and marks the re-inserted SSD as Unconfigured_Bad (UBad).  This prevents the RAID 0 array from being recreated.

To correct this condition,

  1. Set the drive back to a good state.
    # sudo /opt/MegaRAID/storcli/storcli64 /c0/e<enclosure_id>/s<drive_slot> set good force
  2. Run the script to recreate the array.
     # sudo -c -f  

DGX-1: NVSM Storage Alerts are Cleared After Removing All Four RAID 0 Data Drives


When data drives are removed, NVSM raises several alerts including a controller alert; but after removing the last drive, the controller alert is cleared.


NVIDIA is currently investigating this issue.

DGX-1: DSHM does not clear alerts after RAID 0 data drives are recreated


The alert that comes up (for example, from the "nvsm show alerts" command) when removing the RAID 0 data drive is not cleared after replacing the drive,recreating the RAID 0 array, and then rebooting the system.


To clear the alerts, run the following command:

# systemctl restart nvsm-storage-dshm