Improvements#

This section describes the improvements in each release.

Release 1.3.6#

  1. MSE Uptime crash ~67 days (XID 150)

    After extended uptime (approximately 60+ days), systems can exhibit NVLink task scheduling behavior that leads to GPU driver hangs and, when using R580 driver or later, appearance of XID errors 150 and 154 in the kernel log. The overflow handling in the GPU’s NVLink management microcode within the vBIOS firmware is now fixed for this release.

  2. GFM does not handle partition state changed to Error from SM.

    Resolved an issue where the Global Fabric Manager (GFM) did not properly propagate partition error states from the Subnet Manager to the NMX-C GetPartitionInfoList() API. Unhealthy partition states are now correctly reported.

  3. NETIR dumps missing, libnvidia-ml.so error in NV-Bug-Report.

    Fixed an issue where running mst gpu add command failed due to broken symbolic links, resulting in missing NETIR dumps.

  4. BMC RF server does not close inactive TCP sessions, resulting in loss of RF access.

    Resolved an issue where idle HTTPS sessions accumulated on the server without being closed promptly, resulting in an increased number of established but unused connections.

  5. GPU PMU and thermal issues caused by driver-VBIOS race condition

    Resolved a race condition between the driver and VBIOS that caused communication failures, leading to GPU PMU halted errors and thermal issues. The VBIOS update adds locking between response paths to prevent this condition.

Release 1.3.2#

  1. NVLink recovery is enabled by default in this release (NVLink improvement).

  2. Duplicate NVLink plane ID leading to Isolated Xid 145 or NCCL all_reduce_perf failures with “uncorrectable NVLink error”.

    Workaround: In earlier versions, use the following steps as a workaround:

    1. Stop the NMX-C service: nv action stop cluster app nmx-controller

    2. Clear the Session Manager cache by removing files from /var/log/nmx/nmx-c/nvlsm: rm -rf /var/log/nmx/nmx-c/nvlsm/*

    3. Restart NMX-C service: nv action start cluster app nmx-controller

  3. Intermittent CPU/GPU throttling and degraded performance.

    This release includes filtering to prevent false thermal triggers, eliminating throttling and maintaining expected performance levels.

  4. LPDDR MultiBit ECC not in CMET.

    Fixed an issue where disabled memory channels could become re-enabled during CMET table updates.

  5. Incorrect / out of range fan speed reported from NVSwitch.

    Resolved NVSwitch fan speed reporting issue caused by repeated CPLD inventory reads via BMC interface.

  6. NVSwitch SSD was 100% used for syslog.

    Resolved an issue where syslog could fill the /var/log partition on GB200, causing 100% disk usage and system health alarms.

  7. NVSwitch BMC $expand >2 returns 404 on blacklisted Ports.

    Improved Redfish $expand handling for levels >2 to gracefully skip or ignore responses from blacklisted/unsupported endpoints (for example, Switch Ports), eliminating spurious 404 Not Found errors while preserving complete data for supported resources.

  8. Improved NVLink reliability with updated NVLink Reduction (NVR) and accounting for HLL (Head-of-queue Lifetime Limit) when NVLink Recovery is enabled.

  9. Switch detected as unhealthy but the partition was never marked as unhealthy.

    This fix identifies the stale GPU health state after the compute tray removal and marks the GPU as GPU_HEALTH_NO_NVLINK.

  10. Fixed intermittent compute node hangs during heavy multicast object operations.

    Fixed a regression that caused intermittent compute node hangs during heavy multicast object churn, ensuring multicast-intensive workloads complete reliably under full cluster load.