Known Limitations

This section lists known limitations and other issues that will not be fixed.

Unable to Boot from Degraded RAID 1 Array

Issue

After deleting the second partition of the OS RAID 1 array, putting it into a degraded mode, the system cannot be booted.

Explanation and Workaround

This occurs with Red Hat Enterprise Linux 7 or CentOS 7. The OS is booting into emergency mode.

To manually recover, perform the following while in emergency mode to enter maintenance mode.

mdadm --run /dev/md0
exit

While in maintenance mode, recover by replacing the lost RAID partition.

mdadm /dev/md0 --add /dev/nvme1n1p2

NGC Containers Might not Run

Issue

NGC containers might not run without either

  • using the --privileged argument, or

  • disabling selinux

Explanation and Workaround

NVIDIA devices sometimes are not labelled correctly after boot. To work around, issue the following before running the NGC container.

sudo restorecon /dev/nvidia*

DGX-1: NVSM Storage Alerts are Cleared After Removing All Four RAID 0 Data Drives

Issue

When data drives are removed, NVSM raises several alerts including a controller alert; but after removing the last drive, the controller alert is cleared.

Status

This is not a typical or likely use case.

RHEL7’s Version of Docker Does Not Support DLFW Containers 23.05 and Later

Issue

The version of Docker provided in RHEL7, 1.13.1, does not support Deep Learning Framework containers 23.05 or newer.

Explanation and Workaround

The clone3 syscall used in these containers is not supported in this version of Docker. Users are recommended to use DLFW containers 23.04 or older. Alternatively, users can also install docker-ce, 20.10.14 or newer, by following the instructions here.