Improvements#

This section provides information about the improvements in each release.

Release 1.0.6#

  1. MSE Uptime crash ~67 days (XID 150)

    After extended uptime (approximately 60+ days), systems can exhibit NVLink task scheduling behavior that leads to GPU driver hangs and, when using R580 driver or later, appearance of XID errors 150 and 154 in the kernel log. The overflow handling in the GPU’s NVLink management microcode within the vBIOS firmware was fixed to avoid this issue going forward.

  2. Intermittent CX8 to GPU link instability causing CTO and other PCIe errors.

    Stabilized the CX8 and GPU Gen6 link operation with the application of improved PHY settings for receiver gain and PLL stability.

  3. GPU PM L1 stability improvement.

    Improved the GPU implementation of PCI-PM L1 flow through setting changes that now ensure continuous PLL stability.

  4. Tray swaps no longer leave stale GPU state with the system manager, so newly added GPUs are correctly detected and appear in the partition.

  5. Resolved an issue where, after a tray swap, the partition showed only 72 GPUs and nv_cli showed 0.0.0.0.

    Fixed issues around identifying GPUs belonging to a Compute Node.

  6. Fixed switch configuration on NVLink systems that incorrectly allowed short transient congestion conditions to cause spurious timeout events and unnecessary port disables (XID149.33).

  7. Elevated rate of GPU PMU halted and thermal issues reported.

    Resolved a race condition between the driver and VBIOS that could cause communication failures, resulting in elevated rates of GPU PMU halted and thermal issues. The VBIOS update implements locking between response paths to prevent this condition.

  8. Improved NVLink reliability with updated NVLink reduction and accounting for HLL.

    This fix prevents unnecessary analysis of false positives (where an error is indicated but no actual fault exists), reducing wasted effort and improving diagnostic accuracy.

  9. Enable DHCP6 on eth0.

    This fix ensures that DHCP6 is enabled and running.

Release 1.0.1#

  1. Fixed an issue preventing modification of the Rsyslog TransportProtocol in the NVIDIA Switch BMC from default UDP to TCP or other protocols. Transport protocol configuration is now supported.

  2. Resolved an authentication issue that led to a DoS-like state, blocking BMC access and affecting firmware updates and other functionality.

  3. The method for identifying and managing GPUs that belong to the same compute tray has been updated.

  4. IMEX issue in where outgoing gRPC connections were not reliably detected as lost, is improved by event handling and connection recovery, enabling nvidia-imex to promptly reestablish communication and resume processing when a disconnect occurs.

  5. Excessive partition API requests can fill GFM logs.

    Logging has been optimized to capture entries only during actual topology changes, ensuring the preservation of critical information.

  6. eth0/1 routing was not separated, causing incorrect routing when eth0 was disconnected.

    Added a fix to separate the routing per interface (eth0/1).

  7. Resolved an intermittent issue where a deadlock in GFM would result in CUDA error “All CUDA-capable devices are busy or unavailable”.

  8. Resolved an occasional failure with NVLS bind operation (cuMulticastBindMem).

  9. FLR reset hangs during GPU engine transaction.

    Fixed by leveraging the hardware engine reset state machine to reliably reset the engine.

  10. Fixed multicast reference count computation.

    Fixed an issue that caused multicast reference count to be computed incorrectly.

  11. In rare instances, the PKEY request could be rejected by the subnet Manager.

    Added a fix to remove the race condition and the PKEY database is always cleared before sending the response to the GFM.

  12. Partitions are deleted from the domain.

    This update addresses the root cause of the GFM crash, ensuring stability and preventing recurrence of the issue.

  13. Allow GPUs that are part of a multicast group to be removed from the partition.

    Added a fix to allow GPUs that are part of a multicast group to be removed from a partition. The removed GPU is set to a “reset required” state and can’t be used to run workloads until the GPU is reset.

  14. Resolved intermittent C2C link training failure during GPU reset.

    During GPU reset, the Grace<->Blackwell GPU C2C link training process intermittently fails, leading to unrecoverable host errors and, in some instances, Grace firmware crashes causing the host to reboot. This issue has been resolved in the updated GPU firmware, which ensures reliable C2C link training during reset.

  15. Resolved SGPI_D0 and SGPI_E0 status does not reflect MCIO J80.B12.

    Added a fix to report the second presence pin, compared to using just one of the pins.

  16. This release includes PCIe link quality improvements that will resolve an occasional PCIe CTO (Completion Timeout) error that could result in a kernel crash.

  17. Aligning the behavior of NMX-C API where duplicate partition creation returns RESOURCE_USED on GB300 vs. PARTITION_EXISTS in GB200.

    With v1.0.1, GB300 will return NMX_ST_PARTITION_EXISTS (same as GB200).

  18. Updated NMX-C to return partitionId in UpdatePartitionResponse for RemoveGpusFromPartition (“User partition”) and AddGpusToPartition (“User partition”).

    Earlier versions failed to return partitionId.

  19. Resolved incorrect GPU THERM_WARN_INT error message in SEL logs while the GPU was operating still under normal temperature range.

    Such an error was occasionally produced during stress test workloads. This was due to an incorrect temperature threshold in some of the code paths.

  20. Resolves an intermittent issue where a NCCL all-reduce throughput drops. This occurs only when NVLink SHARP is enabled and control messages from GPU to GFM were dropped (for example, buffer contention), resulting in multicast setup failure. This version adds fixes across the GPU driver, Fabric Manager, and GPU firmware to improve the reliability of NVLS initialization.