Fixed Issues - NVIDIA Docs

See the sections for fixed issues.

[DGX-1, DGX-2]: DGCM Update Error

Issue

When attempting to update DCGM, the following error message may be displayed: There was an internal error during the test: 'Couldn't find the ubergemm executable which is required; the install may have failed.'

Explanation

This is a known issue and is anticipated to be fixed in the next release.

[DGX-1, DGX-2]: nysys fails to launch

Issue

When attempting to install a different package version from the CUDA networking repository, `nsys` will not launch.

More specifically, when installing CUDA Toolkit there are some `nsight-systems` packages with different versions and the most recent `nsight-systems-2022.1.3-2022.1.3.3_1c7b5f7-0.x86_64.rpm` will be installed by default.

Workaround

Download and install this package from CUDA repo, which resolves the path to fix the issue: .. code:: text

$ wget https://developer.download.nvidia.com/compute/cuda/repos/rhel8/x86_64/nsight-systems-2021.3.2-2021.3.2.4_027534f-0.x86_64.rpm $ sudo dnf install ./nsight-systems-2021.3.2-2021.3.2.4_027534f-0.x86_64.rpm

[DGX-1, DGX-2]: nvsm dump health Does not Generate sosreport

Issue

[Fixed in EL8-21.08]

After running nvsm dump health, the log file reports

Copy
Copied!

            
            INFO: Could not find sosreport output file

Analysis of the log files reveals that information is missing for components that are installed on the system; such as InfiniBand cards.

Explanation

The sosreport is not getting collected. This will be resolved in a later software release.

[DGX-2]: No Rebuild Function for RAID 0 if Volume is not md1

Issue

[Fixed in EL8-21.08]

If the RAID 0 volumes are a designation other than md1, such as md128, then there is no Rebuild under Target.

Example:

Copy
Copied!

            
            $ nvsm show /systems/localhost/storage/volumes/md128
Properties:
    CapacityBytes = 61462699992678
    Encrypted = False
    Id = md128
    Drives = [ nvme10n1, nvme11n1, nvme12n1, nvme13n1, nvme14n1, nvme15n1, nvme16n1, nvme17n1, nvme2n1, nvme3n1, nvme4n1, nvme5n1, nvme6n1, nvme7n1, nvme8n1, nvme9n1 ]
    Name = md128
    Status_Health = OK
    Status_State = Enabled
    VolumeType = RAID-0
Targets:
    encryption
Verbs:
    cd
    show

Explanation

This occurs if the drives are not configured in the default configuration of two OS drives in a RAID 1 configuration and storage drives used for caching in a RAID 0 configuration. The nvsm rebuild command does not support non-standard drive configurations.

Copy
Copied!

            
            "use_standard_config_storage":false

Remove the NVSM database.

Copy
Copied!

            
            $ sudo rm /var/lib/nvsm/sqlite/nvsm.db

Restart NVSM.

Copy
Copied!

            
            $ systemctl restart nvsm