SNAP-4 Service Release Notes
The release note pages provide information for NVIDIA® BlueField®-3 SNAP software such as changes and new features, software known issues, and bug fixes.
Key features in NVIDIA SNAP 4.5.4:
- Bug fixes
Ref # | Issue |
4403289 | Description: IO timeouts may occur when repeatedly restarting SNAP service while running traffic from the host side. |
Keywords: TimeoutTimeout | |
Discovered in version: 4.5.3 | |
4491957 | Description: Under rare circumstances, SNAP software cannot recover its NVMe controller after a crash. |
Keywords: Crash | |
Discovered in version: 4.5.3 | |
4457230 | Description: When performing stressed and consecutive suspend and resume RPC operations, the host system may hang. |
Keywords: Stress test | |
Discovered in version: 4.5.3 | |
4487171 | Description: When performing host OS reboot while in the process of SNAP live update, the live update process fails, causing the SNAP device to malfunction. |
Keywords: Reboot; live update; hang | |
Discovered in version: 4.5.3 | |
4436211 | Description: When using virtio-blk controller with the --indirect_desc flag under intense pressure, IO may hang. |
Keywords: Hang | |
Discovered in version: 4.5.3 |
SNAP Issues
The following are known limitations of this NVMe/virtio-blk SNAP software version.
Ref # | Issue |
4524204 | Description: Due to a new force flag in the controller_destory RPC, the live update tool fails to update from U2 release LTS to U3 release LTS. |
Workaround: Remove the force flag in destroy_controller in the live update script. | |
Keywords: Live update | |
Discovered in version: 4.4.1 | |
3631346 | Description: When using dynamic MSIX with NVMe protocol, the free_queues PF controller property (as described in s
ection "SR-IOV Dynamic MSIX Management") is not valid and always shows 0 . |
Workaround: Ignore the value and assume the | |
Keywords: NVMe; MSIX | |
Discovered in version: 4.4.1 | |
- | Description: The SPDK |
Workaround: N/A | |
Keywords: NVMe | |
Discovered in version: 4.4.0 | |
3817040 | Description: When running nvme_controller_suspend RPC with the --timeout parameter, if timeout expires, the device is no longer operational and cannot be resumed.
|
Workaround: Destroy and re-create the controller. | |
Keywords: NVMe | |
Discovered in version: 4.4.0 | |
3809646 | Description: When working with a new DPA provider, when sending DMA followed by an interrupt to DPA, it wakes up before DMA is written to the buffer causing DPA to miss events. |
Workaround: Add a software-based periodic wake-up mechanism. | |
Keywords: NVMe | |
Discovered in version: 4.4.0 | |
3773346 | Description: In virtio-blk controller configuration, when running with SPDK NVMe-oF initiator as a backend, an unaligned size_max value may cause memory corruption.
|
Workaround: size_max and seg_max values must be a power of 2.
| |
Keywords: Virtio-blk; NVMe-oF; spdk | |
Discovered in version: 4.3.1 | |
3745842 | Description: When running with NVMe/TCP SPDK block device as a backend, SNAP cannot work over more than 8 cores. |
Workaround: Work with Arm core mask which uses only 8 cores. | |
Keywords: NVMe; TCP; SPDK | |
Discovered in version: 4.3.1 | |
- | Description: The container image may becomes corrupted, resulting in the container status showing as exited with the error message /usr/bin/supervisord: exec format error .
|
Workaround:
Remove the YAML from kubelet, use crictl images to list the images and crictl rmi <image-id> to remove the image. Run systemctl restart containerd and systemctl restart kubelet , then copy the YAML file again to kubelet.
| |
Keywords: NGC; container image | |
Discovered in version: 4.3.1 | |
3757171 | Description: When running virtio-blk emulation with large IOs (>128K) and SPDK's nvmf initiator as a backend, IOs may fail in SPDK layer due to bad alignment. |
Workaround: size_max value of virtio_blk_controller_create RPC must be set and be a power of 2.
| |
Keywords: SPDK, virtio-blk, size_max | |
Discovered in version: 4.3.1 | |
3689918 3753637 | Description: SNAP container bring-up takes a long time when configured with a large number of emulations, possibly taking longer than the default NVMe driver timeout. |
Workaround: Increase NVMe driver IO timeout to 300 seconds (instead of 30). | |
Keywords: NVMe; recovery; kernel driver | |
Discovered in version: 4.3.0 | |
- | Description: NVMeTCP XLIO is currently not supported when running 64K page size kernels on the DPU Arm cores (as is the case for CentOS 8.x, Rocky 8.x, or openEuler 20.x). |
Workaround: N/A | |
Keywords: 64K page size; NVMeTCP XLIO | |
Discovered in version: 4.1.0 | |
3264154 | Description: NVMeTCP XLIO is not supported when running 64K page size kernels on the DPU Arm cores (such is the case with CentOS 8.x, Rocky 8.x, or openEuler 20.x). |
Workaround: N/A | |
Keywords: Page size; NVMeTCP XLIO | |
Discovered in version: 4.1.0 | |
- | Description: NVMe over RDMA full offload is not supported. |
Workaround: N/A | |
Keywords: NVMe over RDMA; support | |
Discovered in version: 4.0.0 | |
4110943 | Description: Hot-unplugging a hotplugged Virtio BLK device is not allowed unless a Virtio BLK controller has previously been created for the device. |
Workaround: Create a Virtio BLK controller on the device which need to do hotunplug | |
Keywords: hotplug, hotunplug | |
Discovered in version: 4.5.0 | |
3985953 | Description: NVMe devices can't be scaled to their maximum level, when PCI functions of other types (virito-blk, virtio-fs, virtio-net etc.) are scaled in parallel |
Workaround: N/A | |
Keywords: SRIOV, NVMe | |
Discovered in version: 4.5.0 |
OS/vendor Issues
The following are not BlueField SNAP limitations.
Ref # | Issue |
- | Description: Some old Windows OS NVMe drivers have buggy usage of SGL support. |
Workaround: Disable SGL support when using Windows OS by setting the | |
Keywords: Windows; NVMe | |
Reported in version: 4.4.0 | |
2879262 | Description: When the virtio-blk kernel driver cannot find enough MSI-X vectors to satisfy all its opened virtqueues, it failovers to assign a single MSI-X vector to all virtqueues which negatively impacts performance. In addition, when a large number (e.g., 64) of virtqueues are associated with a single MSI-X, the kernel may enter a soft-lockup (kernel bug) and the IO will hang. |
Workaround: Always keep | |
Keywords: Virtio-blk; kernel driver; MSI-X | |
Reported in version: 4.3.0 | |
- | Description: If PCIe devices are inserted prior to the hot-plug driver being loaded on host, the hot-plug driver in kernel version less than 4.19 does not enable the slot even if the slot is occupied (i.e., presence detected in slot status register). That is, only the presence state of the slot is changed by firmware but the PCIe slot is not enabled by the kernel after host bootup (i.e., So that we can't get the PCIe device by lspci on host side, and the bdf is 0 on controller. |
Workaround: Add | |
Keywords: Virtio-blk; kernel driver; hot-plug | |
Reported in version: 4.2.1 | |
- | Description: RedHat/Centos 7.x does not handle "online" (post driver probe) namespace additions/removals correctly. |
Workaround: Use | |
Keywords: NVMe; CentOS; RedHat; kernel | |
Reported in version: 4.1.0 | |
- | Description: Some Windows drivers have experimental support for "online" (post driver probe) namespace additions/removal, although such support is not communicated with the device. |
Workaround: Use | |
Keywords: NVMe; Windows | |
Reported in version: 4.1.0 | |
- | Description: VMWare ESXi supports "online" (post driver probe) namespace additions/removal, only if “Namespace Management” is supported by controller. |
Workaround: Use | |
Keywords: NVMe, ESXi | |
Reported in version: 4.1.0 | |
- | Description: Ubuntu 22.04 does not support 500 VFs. |
Workaround: N/A | |
Keywords: Virtio-blk; kernel driver; Ubuntu 22.04 | |
Reported in version: 4.1.0 | |
- | Description: Virtio-blk Linux kernel driver does not handle PCIe FLR events. |
Workaround: N/A | |
Keywords: Virtio-blk; kernel driver | |
Reported in version: 4.0.0 | |
3679373 | Description: Virtio-blk spdk driver (vfio-pci based) does not handle PCIe FLR events. |
Workaround: N/A | |
Keywords: Virtio-blk; SPDK driver | |
Reported in version: 4.3.0 | |
- | Description: A n ew virtio-blk Linux kernel driver (starting kernel 4.18) does not support hot-unplug during traffic. Since the kernel may self-generate spontaneous IOs, on rare occasions, an issue may happen even when no traffic is explicitly being run. |
Workaround: N/A | |
Keywords: Virtio-blk; kernel driver | |
Reported in version: 4.0.0 | |
Description: SPDK NVMf/RDMA initiator fails to connect to kernel NVMf/RDMA remote target. | |
Workaround: Use setting | |
Keywords: SPDK, NVMf, RDMA, kernel | |
Reported in version: 4.3.1 | |
- | Description: Windows OS virtio-blk driver expects at least 64K data to be available for a single IO request |
Workaround: Use | |
Keywords: Windows, virtio-blk | |
Reported in version: 4.3.1 | |
- | Description: Some old Windows OS versions have malfunctioning inbox virtio-blk driver, expects a 3-party virtio-blk driver to be pre-installed to operate properly. |
Workaround: Use verified 3-party driver published by fedora (link). | |
Keywords: Windows, virtio-blk | |
Reported in version: 4.3.1 | |
4184299 | Description: When using hotplugged PCIe devices, host OS can't detect PCIe changes during runtime - it can only do it during booting stage. |
Workaround: In hotplug scenario - after all desired new devices are plugged - OS must be rebooted. In hotunplug scenario - after every existing device that is removed - OS must be rebooted (some Windows versions will perform reboot automatically). | |
Keywords: Hotplug, Windows. | |
Reported in version: 4.5.0 | |
3748674 | Description: On most modern Linux Distributions - Unplugging a PCIe function from the host while there are still inflight IOs on it, will cause virtio-blk driver to hang. |
Workaround: N/A | |
Keywords: Hotplug, Linux. | |
Reported in version: 4.5.0 |