DOCA Documentation v3.1.0

SNAP-4 Service Release Notes

This page describes new features, known issues, and bug fixes for NVIDIA ® BlueField ® -3 SNAP software.

User Experience Updates:

  • Added a new firmware configuration parameter: NVME_EMULATION_MAX_QUEUE_DEPTH, must be set during Firmware configuration.

New Features:

  • Enhanced live update tool to support hit-less updates from SNAP source to SNAP container.

  • Improved SNAP IO latency for virtio-blk.

  • Added support for Ubuntu 24.04 with the BlueField BF-Bundle.

  • Implemented a new mechanism to track NVMe driver bind/unbind events on the host.

  • Added the RPC nvme_controller_dbg_debug_stats_get for querying the states of submission queues (SQ) and completion queues (CQ) for NVMe.

  • Introduced beta-level VFIO live migration support for virtio-blk.

Bug Fixes:

  • Various bug fixes to enhance system stability.

Compatibility:

  • SNAP July container 4.8.0 is compatible with April GA firmware.

  • Enhanced live update tool to enable hit-less updates from SNAP 4.7.0 (April) to SNAP 4.8.0 (July).

Ref #

Issue

4436211

Description: I/O operations from the host side may hang when using the vblk emulation device with indirect mode.

Keywords: indirect mode

Discovered in version: 4.5.3

4503883

Description: Fixed a problem where controller destruction would stall in the BDEV reset.

Keywords: Tear-down stall

Discovered in version: 4.5.3

4507936

Description: Corrected the default MQES value for hotplug devices

Keywords: MQES

Discovered in version: 4.5.3

4508111

Description: Addressed an issue where multiple disks would become invisible after performing a live update during a host reboot.

Keywords: live update; host reboot

Discovered in version: 4.5.3

4510772

Description: Fixed a kernel panic on the host that could occur following SNAP NVMe recovery procedures.

Keywords: live update; host reboot

Discovered in version: 4.5.3

SNAP Issues

Ref #

Issue

4499506

Description: Live update fails when multiple remote bdevs are configured, due to a QP creation timeout.

Workaround: N/A

Keywords: Live update; remote bdev

Discovered in version: 4.8.0

4561599

Description: NVMe UEFI expansion ROM driver requires at least 2 IO queues to be supported by SNAP controller.

Workaround: Use --num_queues=2 (or higher) with nvme_controller_create RPC for NVMe boot devices.

Keywords: NVMe; UEFI boot

Discovered in version: 4.8.0

4499506

4502937

Description: When working with high scale (>=16) of virtio-blk or NVMe functions, backed by SPDK's NVMe-oF /RDMA bdevs as a backend, live update process may fail.

Workaround: Use "regular" update process (close + reopen SNAP application).

Keywords: NVMe; virtio-blk; live update; SPDK; NVMe-oF /RDMA

Reported in version: 4.8.0

4513104

Description: If the virtio-blk driver in the VM is unprobed or there is an FLR on the VM function during virtio-blk VFIO live migration, the driver may get stuck.

Workaround: N/A

Keywords: virtio-blk; live migration; FLR; reset; driver unprobe

Reported in version: 4.8.0

-

Description: Linux kernel'svirtio-blk driver expects --size_max to be at least the size of the of OS page. If it is not, a driver warning will be prompted on the host side.

Workaround: Use the --size_max option with a value equal to. the OS page size (typically 4096).

Keywords: virtio_blk; linux; kernel; size_max

Reported in version: 4.7.0

4396707

Description: SPDK's virtio-blk driver uses 4K segment sizes, even though size_max technically allows for larger segments. This might cause large logical block size IO handling to fail.

Workaround: When using the SPDK virtio-blk driver, always set the --size_max option with a value of 4096.

Keywords: virtio_blk; size_max; SPDK

Reported in version: 4.7.0

4409344

Description: When performing live update too fast (using automated script), destination process might not yet create all necessary resources, when prompted to handshake with source process.

Workaround: Add a sleep(3) between creation of controller on destination process (with --live_update_listener flag) and the succeeding request to suspend source controller (with --live_update_notifier flag). That will increase the overall automated script runtime but should not affect the host driver's downtime.

Keywords: virtio_blk; nvme; live update

Reported in version: 4.7.0

4412341

Description: When using high scale (512<=) of virtio-blk VFs on a single PF, sudden hypervisor crash (or brutal warm reboot) may result in hypervisor hang, due to the long FLR processing time.

Workaround: Split the opened VFs among more PFs; gracefully shutdown VMs before performing hard OS reset

Keywords: virtio-blk; FLR; SR-IOV

Reported in version: 4.7.0

4104709

Description: Some legacy operating systems (e.g., RockyLinux with kernel 4.18) issues virtio-blk zero-length I/Os during boot (e.g., during EDD probing).

Workaround: Set VIRTIO_BLK_ZERO_LEN_IO_FAIL=1 environment variable to configure SNAP to fail zero-length I/Os in virtio-blk.

Keywords: virtio-blk; zero-length I/Os

Reported in version: 4.6.0

-

Description: When using both NVMe and Virtio-blk protocols in SNAP, their data providers may share the same DPA HARTS, potentially causing NVMe configuration or performance issues. This is especially relevant when Virtio-blk is in DPU mode and NVMe is in DPA mode.

Workaround: Use dpa_helper_core_mask and dpa_nvme_core_mask environment variable as DPA core masks, for example dpa_helper_core_mask=0x0000FFFF and dpa_nvme_core_mask=0xFFFF0000

Keywords: Virtio-blk and NVMe; DPA core mask.

Reported in version: 4.6.0

4110943

Description: Hot-unplugging a hotplugged Virtio BLK device is not allowed unless a Virtio BLK controller has previously been created for the device.

Workaround: Create a Virtio BLK controller on the device which need to do hotunplug

Keywords: hotplug; hotunplug

Discovered in version: 4.5.0

3631346

Description: When using dynamic MSIX with NVMe protocol, the free_queues PF controller property is not valid and always shows 0.

Workaround: Ignore the value and assume the free_queues pool is large enough.

Keywords: NVMe; MSIX

Discovered in version: 4.4.1

-

Description: The SPDK bdev_uring is not supported.

Workaround: N/A

Keywords: NVMe

Discovered in version: 4.4.0

3817040

Description: When running nvme_controller_suspend RPC with the --timeout parameter, the device is no longer operational and cannot be resumed if the timeout expires.

Workaround: Destroy and re-create the controller.

Keywords: NVMe

Discovered in version: 4.4.0

3809646

Description: When working with a new DPA provider, if an interrupt is sent to DPA immediately after a DMA operation, DPA may wake up before the DMA is fully written to the buffer, causing it to miss events.

Workaround: Add a software-based periodic wake-up mechanism.

Keywords: NVMe

Discovered in version: 4.4.0

3773346

Description: When configuring the virtio-blk controller, using an unaligned size_max value with the SPDK NVMe-oF initiator as the backend can lead to memory corruption.

Workaround: size_max and seg_max values must be a power of 2.

Keywords: Virtio-blk; NVMe-oF; spdk

Discovered in version: 4.3.1

3745842

Description: When using the NVMe/TCP SPDK block device as a backend, SNAP is limited to working with no more than 8 cores.

Workaround: Work with Arm core mask which uses only 8 cores.

Keywords: NVMe; TCP; SPDK

Discovered in version: 4.3.1

-

Description: Container images may become corrupted, resulting in a container status of exited with the error message /usr/bin/supervisord: exec format error.

Workaround: Remove the YAML from kubelet, use crictl images to list the images and crictl rmi <image-id> to remove the images. Run systemctl restart containerd and systemctl restart kubelet, then copy the YAML file again to kubelet.

Keywords: NGC; container image

Discovered in version: 4.3.1

3757171

Description: When running virtio-blk emulation with large IOs (>128K) and SPDK's NVMe-oF initiator as a backend, IOs might fail in the SPDK layer due to poor alignment.

Workaround: size_max value of virtio_blk_controller_create RPC must be set and be a power of 2.

Keywords: SPDK; virtio-blk; size_max

Discovered in version: 4.3.1

3689918

3753637

Description: The SNAP container takes a long time to start up when configured with a large number of emulations, potentially exceeding the default NVMe driver timeout.

Workaround: Increase the NVMe driver IO timeout from 30 to 300 seconds.

Keywords: NVMe; recovery; kernel driver

Discovered in version: 4.3.0

-

Description: NVMeTCP XLIO is not supported when running 64K page size kernels on the DPU Arm cores (as is the case for CentOS 8.x, Rocky 8.x, or openEuler 20.x).

Workaround: N/A

Keywords: 64K page size; NVMeTCP XLIO

Discovered in version: 4.1.0

3264154

Description: NVMeTCP XLIO is not supported when running 64K page size kernels on the DPU Arm cores (such is the case with CentOS 8.x, Rocky 8.x, or openEuler 20.x).

Workaround: N/A

Keywords: Page size; NVMeTCP XLIO

Discovered in version: 4.1.0

-

Description: NVMe over RDMA full offload is not supported.

Workaround: N/A

Keywords: NVMe over RDMA; support

Discovered in version: 4.0.0


OS or Vendor Issues

Ref #

Issue

4408109

Description: When using a Windows OS on the host side, the controller must be created with a --num_queues value greater than or equal to the number of CPU cores on the Windows server. Failing to do so may result in undefined behavior, and the disk might fail to initialize properly.

Workaround: N/A

Keywords: Windows, Windows OS

Discovered in version: 4.8.0

4540848

Description: Windows NVMe driver ignores MQEs value set on the PCI BAR, and forcefully trying to open queues with the size of 256.

Workaround: Set ` NVME_EMULATION_MAX_QUEUE_DEPTH` in mlxconfig to be >= 8 (default is 12), to support MQES >= 256 queue depth.

Keywords: NVMe, Windows

Discovered in version: 4.8.0

-

Description: Windows OS assumes NVMe devices support at least 2 IO CQs (CQ ID 2 exists), even when the controller declares it only supports 1 IO queue.

Workaround: Open NVMe controller with –num_queues value >= 2

Keywords: Windows, NVMe

Discovered in version: 4.7.0

4418372

Description: On Windows OS, hot-unplugging a virtio-blk PCIe function can cause unexpected behavior, and a host reboot might be necessary to recover. This is because the Windows OS does not support online PCIe rescan, unlike Linux.

Workaround: Before unplugging a PCIe function, disable its storage controller in Device Manager

Keywords: Windows, virtio-blk, hotplug

Discovered in version: 4.7.0

4206444

Description: The Linux kernel driver does not restrict seg_max based on the maximum queue_size reported by the device. This can lead to a WARN_ON_ONCE() trigger in the kernel, resulting in driver misbehavior and potential system hangs.

Workaround: Ensure that all virtio-blk controllers are configured with a seg_max value smaller than (queue_size-2).

Keywords: Virtio-blk; kernel driver; seg_max; queue_size

Reported in version: 4.6.0

-

Description: When using hotplugged PCIe devices, after all devices are plugged, the host must be rebooted for Windows to detect all devices (some Windows versions may perform reboot automatically). This is requires as Windows OS does not support online PCIe rescan (as in Linux).

Workaround: N/A

Keywords: Hotplug, Windows

Reported in version: 4.5.0

3748674

Description: On most modern Linux distributions, unplugging a PCIe function from the host while there are inflight I/Os can cause the virtio-blk driver to hang.

Workaround: N/A

Keywords: Hotplug, Linux

Reported in version: 4.5.0

-

Description: Some old Windows OS NVMe drivers have buggy usage of SGL support.

Workaround: Disable SGL support when using Windows OS by setting the --quirks bit 4 to 1 in snap_rpc.py nvme_controller_create RPC.

Keywords: Windows; NVMe

Reported in version: 4.4.0

2879262

Description: When the virtio-blk kernel driver cannot find enough MSI-X vectors to satisfy all its opened virtqueues, it failovers to assign a single MSI-X vector to all virtqueues which negatively impacts performance. In addition, when a large number (e.g., 64) of virtqueues are associated with a single MSI-X, the kernel may enter a soft-lockup (kernel bug) and the IO will hang.

Workaround: Always keep num_queues < num_msix. Do not set --num_queues when creating virtio-blk controllers. The optimal value is automatically chosen based on available MSI-X.

Keywords: Virtio-blk; kernel driver; MSI-X

Reported in version: 4.3.0

-

Description: If PCIe devices are inserted before the hot-plug driver is loaded on the host, the hot-plug driver in kernel versions less than 4.19 does not enable the slot, even if the slot is occupied (i.e., presence detected in the slot status register). This means that only the presence state of the slot is updated by the firmware, but the PCIe slot is not enabled by the kernel after the host boots up.

As a result, the PCIe device will not be visible when using lspci on the host side, and the bus, device, and function (BDF) identifier will show as 0 on the controller.

Workaround: Add pciehp.pciehp_force=1 to the boot command line on host.

Keywords: Virtio-blk; kernel driver; hot-plug

Reported in version: 4.2.1

-

Description: RedHat/Centos 7.x does not handle "online" (post driver probe) namespace additions or removals correctly.

Workaround: Use --quirks=0x2 option in snap_rpc.py nvme_controller_create.

Keywords: NVMe; CentOS; RedHat; kernel

Reported in version: 4.1.0

-

Description: Some Windows drivers have experimental support for "online" (post driver probe) namespace additions/removal, although such support is not communicated with the device.

Workaround: Use --quirks=0x1 option in snap_rpc.py nvme_controller_create.

Keywords: NVMe; Windows

Reported in version: 4.1.0

-

Description: VMWare ESXi supports "online" (post driver probe) namespace additions/removal, only if “Namespace Management” is supported by controller.

Workaround: Use --quirks=0x8 option in snap_rpc.py nvme_controller_create.

Keywords: NVMe, ESXi

Reported in version: 4.1.0

-

Description: Ubuntu 22.04 does not support 500 VFs.

Workaround: N/A

Keywords: Virtio-blk; kernel driver; Ubuntu 22.04

Reported in version: 4.1.0

-

Description: Virtio-blk Linux kernel driver does not handle PCIe FLR events.

Workaround: N/A

Keywords: Virtio-blk; kernel driver

Reported in version: 4.0.0

Description: SPDK NVMe-oF/RDMA initiator fails to connect to kernel NVMe-oF/RDMA remote target.

Workaround: Use setting spdk_rpc.py bdev_nvme_set_options --io-queue-requests=128 on SPDK configuration

Keywords: SPDK, NVMe-oF, RDMA, kernel

Reported in version: 4.3.1

-

Description: Windows OS virtio-blk driver expects at least 64K data to be available for a single IO request

Workaround: Use seg_max and size_max parameters configuration to match requirements (seg_max * size_max > 64K).

Keywords: Windows, virtio-blk

Reported in version: 4.3.1

-

Description: Some older Windows OS versions have malfunctioning inbox virtio-blk driver, expects a 3-party virtio-blk driver to be pre-installed to operate properly.

Workaround: Use a verified 3rd-party driver from Fedora

Keywords: Windows, virtio-blk

Reported in version: 4.3.1

3679373

Description: Virtio-blk spdk driver (vfio-pci based) does not handle PCIe FLR events.

Workaround: N/A

Keywords: Virtio-blk; SPDK driver

Reported in version: 4.3.0

-

Description: A n ew virtio-blk Linux kernel driver (starting kernel 4.18) does not support hot-unplug during traffic. Since the kernel may self-generate spontaneous IOs, on rare occasions, an issue may arise even when there is no traffic.

Workaround: N/A

Keywords: Virtio-blk; kernel driver

Reported in version: 4.0.0

-

Description: When using SRIOV with VFs sharing the same driver as their PF (sriov_driver_autoprobe=1), unprobing the driver may take a long time, and some admin commands might timeout

Workaround: N/A

Keywords: SRIOV; driver.

Discovered in version: 3.8.0-8


© Copyright 2025, NVIDIA. Last updated on Sep 4, 2025.