Known Issues#
This section provides a list of the known issues.
Partition will show max 72 GPUs only along with nv CLI shows 0.0.0.0 for replaced GPUs after tray swap.
Workaround: Following the tray replacement (old tray with a new one), run the
GetGpuInfoList()NMX-C API. The API output:Includes GPUs that are currently in the domain, including GPUs that were added as part of the new tray.
Excludes GPUs that were removed from the domain (with location 0.0.0.0).
API call example:
grpcurl -plaintext -d '{"context": {}, "loc": {}, "gatewayId": "myGateway"}' 127.0.0.1:9371 nmx_c.NMX_Controller/GetGpuInfoList
FundamentalResetExitCount is zero after Function Level Reset (FLR).
FundamentalResetExitCount is not updated correctly when a Function Level Reset (FLR) occurs. The counter remains 0 because the logic that records the reset exit runs only in non-FLR reset paths. This is a logging issue only; FLR resets work as expected.
Workaround: There is currently no workaround, and this will be resolved in future VBIOS release.
CPU core soft lockup.
The system may experience higher rates of CPU core softlocks during OS runtime on this firmware version.
Workaround: 1.3.6 has downgraded to SBIOS 2.05.06. This issue will be fixed in a future release.
MC powercycle of compute and switch fail to bring up Fabric Manager.
In rare cases, following an unexpected power failure or a switch node crash, Fabric Manager may fail to start due to database (Fabric Manager resource information) corruption.
Workaround: There is currently no method to recover the corrupted data. To restore functionality, perform an NMX-C reset to reinitialize the cluster and Fabric Manager.
Segmentation faults occur when running some applications.
During process teardown, applications may encounter segmentation faults under some circumstances.
Workaround: There is currently no workaround, and this issue will be fixed in TRD5.
In rare instances (< 0.01%), BMC management interface may fail to train phy link.
In rare instances (< 0.01%), the BMC management interface phy may not link correctly with a switch.
Workaround: Reboot the BMC through the host via ipmi (
ipmitool mc reset cold) or through Redfish via the BMC to Host USB Ethernet interface:curl -k -u "user:password" -X POST https://${bmcip}/redfish/v1/Managers/BMC_0/Actions/Manager.Reset -d '{"ResetType": "ForceRestart"}'
Unexpected data (PCB sensor) was detected at “/redfish/v1/Chassis/HGX_Chassis_0/Sensors”.
Renamed the PCB-Temp sensor to Exhaust-Temp sensor. While the PCB sensor no longer exists in the RF sensor list, the PCB temp events still exist and increment on exhaust temp event occurrence.
Workaround: There is currently no workaround, and this issue will be fixed in a future release.
When updating VBIOS, percent complete still shows 0, after the task is complete.
In some cases, the task service indicates a success condition while the progress property remains at ‘0’. In this case, ignore the progress property and monitor the TaskState and TaskStatus to determine update status.
Workaround: There is currently no workaround, and this issue will be fixed in a future release.
Grace needs to clear the dpc_trigger_status.
In hotplug scenarios that involve CX8 downstream devices that enable DPC, some DPC events that are triggered by the hotplug attempts are not correctly handled by RAS firmware. This causes the DPC trigger status to remain set after a hotplug DPC event, which results in the downstream link remaining disabled.
Workaround: Disable DPC for CX8 downstream links before you attempt to hotplug the devices. This issue will be fixed in the next release.
NeighborMTUDiscards property is not defined in MetricReportDefinitions.
NeighborMTUDiscards property is not defined in MetricReportDefinitions but exists in the Metric Reports.
Workaround: This property will be removed in a future release. Property read will fail — do not use.
When an incorrect static topology file (for example 2x36) is specified on a 1x72 topology, FM log displays duplicate control plane error states.
When an incorrect static topology file (for example 2x36) is specified on a 1x72 topology, both CONFIG_ERROR_ADDITIONAL_CHASSIS_DETECTED and CONFIG_ERROR_MISSING_CHASSIS are set.
Workaround: There is currently no workaround, and this issue will be fixed in a future NVOS release.
GPU Status state missing from metric reports.
Status/Status URI is missing from
/redfish/v1/TelemetryService/MetricReports/HGX_ProcessorMetrics_0.Workaround: Status can be read from telemetry service processor metrics (
/redfish/v1/Systems/HGX_Baseboard_0/Processors/GPU_{GpuId}#/Status/State). This issue will be fixed in a future release.Disable access link retraining.
The GPU Access Link Retraining feature is disabled in this release. After upgrading, a compute tray reboot or GPU reset is required. With this release, a compute tray reboot or GPU reset is also required to bring GPU NVLink back online when a switch tray is rebooted or power-cycled for any reason. This feature will be re-enabled in a future NVOS release.
Inconsistent GPU memory reported by NSM, nvidia-smi, and Redfish.
Multiple management interfaces report different total GPU memory values: NSM Type 3 (0x0C) and nvidia-smi (-q -d MEMORY) return 284,208 MiB, while Redfish TotalMemorySizeMiB returns 285,324 MiB. This occurs when querying inventory or memory capacity from NSM, nvidia-smi, MODS, and Redfish.
Reporting inconsistency may cause confusion in monitoring, inventory reconciliation, or capacity planning. There is no data-loss or security impact identified.
Workaround: There is currently no workaround, and this issue will be fixed in a future release.
BMC PCIe link reset causes SBIOS exception.
When the PCIe link between the Grace CPU and BMC is reset at runtime, the system might take a fatal exception, which generates a Fatal CPER event. This issue only impacts the NVIDIA reference design but might impact partners.
Workaround: Customers should avoid ungraceful PCIe link resets to the BMC while systems are operational. Partners that do not route PCIe to the BMC are not impacted by this issue.
No mechanism to disable host IPMI interface.
This firmware release does not provide a mechanism to disable the host IPMI interface and any privileged user running on a compute tray can send IPMI commands to the BMC on the same compute tray.
Workaround: A future BMC firmware release will provide an interface to restrict host IPMI commands. In the meantime, necessary access controls need to be implemented on the compute trays to limit host privileged access by users not authorized to access the BMC.
Fan Control and Leak Detector user configured settings that are modified through Redfish PATCH API will reset to default after BMC firmware update.
Some Fan Control and Leak Detector properties are configurable through the PATCH method on Redfish. For example, the user is able to modify whether the BMC will shutdown the Chassis when a leak is detected through this API:
curl -s -k -u ${USER}:${PASSWORD} https://${BMCIP}/redfish/v1/Chassis/Chassis_0/Oem/Nvidia/Policies/LeakDetectionPolicy --request PATCH -d '{"PolicyEnabled":true}'
This setting will persist through BMC resets and tray power cycles, but will not currently survive a BMC firmware update.
Workaround: Check and reapply any desired settings after a BMC firmware update.
The Redfish Firmware Inventory API intermittently fails to fetch certain firmware endpoints.
The following Redfish API to retrieve the current FirmwareInventory:
curl -s -k -u ${USER}:${PASSWORD} https://${BMCIP}/redfish/v1/UpdateService/FirmwareInventory
Would be missing some firmware inventory endpoints on rare occasions. The cause has been identified and the fix will be included in a future BMC release.
Workaround: Re-run the same Redfish API.
Liteon PowerShelf FW Update Task Monitoring Does Not Fully Track Progress for PMC and PSU Firmware.
During Liteon PMC firmware updates performed in Immediate mode, the update process does not report task progress. While the update is in progress, the BMC becomes temporarily inaccessible, which prevents accurate monitoring.
A similar limitation exists during PSU firmware updates, where task monitoring is not fully reliable due to an indexing issue from 0–5 to 1–6 in PMC FW 1.3.10.
It’s important to note that the firmware updates themselves complete successfully for both PSU and PMC FW. The issue is specifically with task monitoring, which can affect users who rely on automation or scripted checks. This limitation is known in Liteon PMC firmware versions 1.3.10 and 1.3.9.
Workaround: There is currently no definitive workaround. Users must wait approximately 10 minutes for the PMC to reboot and the BMC to become accessible again after initiating the update. The task monitoring issue is expected to be resolved in the upcoming Liteon PMC firmware release 1.3.11.
[5770595] Leak detector voltage values read through the compute tray BMC via ipmi are clamped to 0.165 and 1.815.
Workaround: In order to read the correct voltage value if the sensor is outside of that range, read the sensor value through Redfish under
Chassis_0/Sensors.Customers are always advised to use the standard Leak Detection methods under
/redfish/v1/Chassis/Chassis_0/ThermalSubsystem/LeakDetection/LeakDetectorsfor determining if a leak is present rather than reading the voltage value of the sensor directly.