See the sections for fixed issues.
[DGX-1, DGX-2]: DGCM Update Error
When attempting to update DCGM, the following error message may be displayed:
There was an internal error during the test: 'Couldn't find the ubergemm executable which is required; the install may have failed.'
This is a known issue and is anticipated to be fixed in the next release.
[DGX-1, DGX-2]: nysys fails to launch
When attempting to install a different package version from the CUDA networking repository, nsys will not launch.
More specifically, when installing CUDA Toolkit there are some nsight-systems packages with different versions and the most recent nsight-systems-2022.1.3-2022.1.3.3_1c7b5f7-0.x86_64.rpm will be installed by default.
Download and install this package from CUDA repo, which resolves the path to fix the issue:
$ wget https://developer.download.nvidia.com/compute/cuda/repos/rhel8/x86_64/nsight-systems-2021.3.2-2021.3.2.4_027534f-0.x86_64.rpm
$ sudo dnf install ./nsight-systems-2021.3.2-2021.3.2.4_027534f-0.x86_64.rpm
[DGX-1, DGX-2]: nvsm dump health Does not Generate sosreport
[Fixed in EL8-21.08]
nvsm dump health, the log file reports
INFO: Could not find sosreport output file
Analysis of the log files reveals that information is missing for components that are installed on the system; such as InfiniBand cards.
The sosreport is not getting collected. This will be resolved in a later software release.
[DGX-2]: No Rebuild Function for RAID 0 if Volume is not md1
[Fixed in EL8-21.08]
If the RAID 0 volumes are a designation other than
md1, such as
md128, then there is no
$ nvsm show /systems/localhost/storage/volumes/md128
CapacityBytes = 61462699992678
Encrypted = False
Id = md128
Drives = [ nvme10n1, nvme11n1, nvme12n1, nvme13n1, nvme14n1, nvme15n1, nvme16n1, nvme17n1, nvme2n1, nvme3n1, nvme4n1, nvme5n1, nvme6n1, nvme7n1, nvme8n1, nvme9n1 ]
Name = md128
Status_Health = OK
Status_State = Enabled
VolumeType = RAID-0
This occurs if the drives are not configured in the default configuration of two OS drives in a RAID 1 configuration and storage drives used for caching in a RAID 0 configuration. The
nvsm rebuild command does not support non-standard drive configurations.
[DGX-2]: Storage Alerts Persist from Previous RAID Configuration
[fixed with EL8-21.08] After switching to a custom drive configuration, such as by adding or removing storage drives, any NVSM storage alerts from the previous configuration will still be reported even though the current drive status is healthy.
For example, alerts can appear for md0 even though the current RAID drive name is md125 and healthy.
To configure NVSM to support custom drive partitioning, perform the following.
/etc/nvsm/nvsm.configand set the “
use_standard_config_storage” parameter to false.
Remove the NVSM database.
$ sudo rm /var/lib/nvsm/sqlite/nvsm.db
$ systemctl restart nvsm