Known Issues

See the sections for specific versions to see which issues are open in those versions.

[DGX-1, DGX-2]: DGCM Update Error

Issue

When attempting to update DCGM, the following error message may be displayed: There was an internal error during the test: 'Couldn't find the ubergemm executable which is required; the install may have failed.'

Explanation

This is a known issue and is anticipated to be fixed in the next release.

[DGX-1, DGX-2]: GPU MIG Partitions do not return output fields

Issue

When enabling MIG and creating a MIG partition for the GPU, there is no output returned for non-device specific fields: dcgmi dmon -e 1,2,3,4,5

Explanation

This issue affects EL 8 with:
  • Driver Version: 470.141.03
  • CUDA Version: 11.4.152
  • DCGM: 2.4.5

[DGX-1, DGX-2]: Log displays CEC error

Issue

DGX A100 Firmware Update Container log may show error messages such as "Unable to send RAW command (channel=0x0 netfn=0x3c lun=0x0 cmd=0xf rsp=0xd3): Destination unavailable"

This error will be displayed when running supported commands and may be safely ignored.

[DGX-1]: NVSM show controllers SerialNumber shows `NOT_SET`

Issue

After rebooting, `nvsm show controllers` may display a blank serial number.

Explanation

This issue is specific to the DGX-1 platform with the MegaRAID controller and can be remedied by restarting the nvsm service after 30 minutes. To restart the service, run `systemctl restart nvsm`

[DGX-1, DGX-2]: nysys fails to launch

Issue

When attempting to install a different package version from the CUDA networking repository, `nsys` will not launch.

More specifically, when installing CUDA Toolkit there are some `nsight-systems` packages with different versions and the most recent `nsight-systems-2022.1.3-2022.1.3.3_1c7b5f7-0.x86_64.rpm` will be installed by default.

Workaround

Download and install this package from CUDA repo, which resolves the path to fix the issue:
$ wget https://developer.download.nvidia.com/compute/cuda/repos/rhel8/x86_64/nsight-systems-2021.3.2-2021.3.2.4_027534f-0.x86_64.rpm
$ sudo dnf install ./nsight-systems-2021.3.2-2021.3.2.4_027534f-0.x86_64.rpm 

[DGX-1, DGX-2]: nvsm dump health Does not Generate sosreport

Issue

[Fixed in EL8-21.08]

After running nvsm dump health, the log file reports

INFO: Could not find sosreport output file

Analysis of the log files reveals that information is missing for components that are installed on the system; such as InfiniBand cards.

Explanation

The sosreport is not getting collected. This will be resolved in a later software release.

[DGX-2]: No Rebuild Function for RAID 0 if Volume is not md1

Issue

[Fixed in EL8-21.08]

If the RAID 0 volumes are a designation other than md1, such as md128, then there is no Rebuild under Target.

Example:

$ nvsm show /systems/localhost/storage/volumes/md128
Properties:
    CapacityBytes = 61462699992678
    Encrypted = False
    Id = md128
    Drives = [ nvme10n1, nvme11n1, nvme12n1, nvme13n1, nvme14n1, nvme15n1, nvme16n1, nvme17n1, nvme2n1, nvme3n1, nvme4n1, nvme5n1, nvme6n1, nvme7n1, nvme8n1, nvme9n1 ]
    Name = md128
    Status_Health = OK
    Status_State = Enabled
    VolumeType = RAID-0
Targets:
    encryption    
Verbs:
    cd
    show

Explanation

This occurs if the drives are not configured in the default configuration of two OS drives in a RAID 1 configuration and storage drives used for caching in a RAID 0 configuration. The nvsm rebuild command does not support non-standard drive configurations.

[DGX-2]: Storage Alerts Persist from Previous RAID Configuration

Issue

[fixed with EL8-21.08] After switching to a custom drive configuration, such as by adding or removing storage drives, any NVSM storage alerts from the previous configuration will still be reported even though the current drive status is healthy.

For example, alerts can appear for md0 even though the current RAID drive name is md125 and healthy.

Explanation

To configure NVSM to support custom drive partitioning, perform the following.
  1. Edit /etc/nvsm/nvsm.config and set the "use_standard_config_storage" parameter to false.
    "use_standard_config_storage":false
  2. Remove the NVSM database.
     $ sudo rm /var/lib/nvsm/sqlite/nvsm.db 
  3. Restart NVSM.
    $ systemctl restart nvsm