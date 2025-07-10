RPC log history (enabled by default) records all the RPC requests (from snap_rpc.py and spdk_rpc.py ) sent to the SNAP application and the RPC response for each RPC requests in a dedicated log file, /var/log/snap-log/rpc-log . This file is visible outside the container (i.e., the log file's path on the DPU is /var/log/snap-log/rpc-log as well).

The SNAP_RPC_LOG_ENABLE env can be used to enable ( 1 ) or disable ( 0 ) this feature.

Info RPC log history is supported with SPDK version spdk23.01.2-12 and above.

Warning When RPC log history is enabled, the SNAP application writes (in append mode) RPC request and response message to /var/log/snap-log/rpc-log constantly. Pay attention to the size of this file. If it gets too large, delete the file on the DPU before launching the SNAP pod.

The configuration of SR-IOV depends on the kernel version and requires careful consideration to ensure optimal device visibility and system stability.

For deployments with multiple virtual devices, autoprobe must be disabled to ensure reliable device discovery by setting sriov_drivers_autoprobe=0 in /sys/bus/pci/devices/<BDF>/ . Failing to do this may cause the following issues:

Incomplete device visibility

Missing virtual disks

Potential system hangs during device initialization

Unreliable behavior with large numbers of VFs (>100)

Configurations required for large-scale deployments:

Disable autoprobe: Copy Copied! echo 0 > /sys/bus/pci/devices/<BDF>/sriov_drivers_autoprobe Manually bind the VFs as required using your specific driver binding tools (e.g., driverctl or bind/unbind in sysfs).

For small-scale deployments (with fewer than 100 VFs), you can use the sriov_totalvfs sysfs entry to set the number of VFs:

Copy Copied! echo <number_of_vfs> > /sys/bus/pci/devices/<BDF>/sriov_totalvfs

Note Applying this configuration for larger deployments.

Note After completing the SR-IOV configuration, no disk will be exposed in the hypervisor by default. The disk will only appear within the VM after the associated PCIe VF is assigned to the VM using the virtualization manager (e.g., libvirt, VMware, etc.). If you need to use the device directly from the hypervisor, manually bind the PCIe VF to the desired driver.

Note Hot-plug PFs do not support SR-IOV.

Info It is recommended to add pci=assign-busses to the boot command line when creating more than 127 VFs. Note Without this option, the following errors may appear from host and the virtio driver will not probe these devices: Copy Copied! pci 0000 : 84 : 00.0 : [1af4: 1041 ] type 7f class 0xffffff pci 0000 : 84 : 00.0 : unknown header type 7f, ignoring device





Note Zero-copy is supported on SPDK 21.07 and higher.

SNAP-direct allows SNAP applications to transfer data directly from the host memory to remote storage without using any staging buffer inside the DPU.

SNAP enables the feature according to the SPDK BDEV configuration only when working against an SPDK NVMe-oF RDMA block device.

To enable zero copy, set the environment variable (as it is enabled by default):

Copy Copied! SNAP_RDMA_ZCOPY_ENABLE=1

For more info refer to the section SNAP Environment Variables.

NVMe/TCP Zero Copy is implemented as a custom NVDA_TCP transport in SPDK NVMe initiator, and it is based on a new XLIO socket layer implementation.

The implementation is different for Tx and Rx:

The NVMe/TCP Tx Zero Copy is similar between RDMA and TCP in that the data is sent from the host memory directly to the wire without an intermediate copy to Arm memory

The NVMe/TCP Rx Zero Copy allows achieving partial zero copy on the Rx flow by eliminating copy from socket buffers (XLIO) to application buffers (SNAP). But data still must be DMA'ed from Arm to host memory.

To enable NVMe/TCP Zero Copy, use SPDK v22.05.nvda --with-xlio ( v22.05.nvda or higher).

Note For more information about XLIO including limitations and bug fixes, refer to the NVIDIA Accelerated IO (XLIO) Documentation.

To enable SNAP TCP XLIO Zero Copy:

SNAP container: Set the environment variables and resources in the YAML file: Copy Copied! resources: requests: memory: "4Gi" cpu: "8" limits: hugepages-2Mi: "4Gi" memory: "6Gi" cpu: "16" ## Set according to the local setup env: - name: APP_ARGS value: "--wait-for-rpc" - name: SPDK_XLIO_PATH value: "/usr/lib/libxlio.so" SNAP sources: Set the environment variables and resources in the relevant scripts In run_snap.sh , edit the APP_ARGS variable to use the SPDK command line argument --wait-for-rpc : run_snap.sh Collapse Source Copy Copied! APP_ARGS="--wait-for-rpc" In set_environment_variables.sh , uncomment the SPDK_XLIO_PATH environment variable: set_environment_variables.sh Collapse Source Copy Copied! export SPDK_XLIO_PATH="/usr/lib/libxlio.so"

Note NVMe/TCP XLIO requires a BlueField Arm OS hugepage size of 4Gi. For information on configuring the hugepages, refer to sections "Step 1: Allocate Hugepages" and "Adjusting YAML Configuration". At high scale, it is required to use the global variable XLIO_RX_BUFS=4096 even though it leads to high memory consumption. Using XLIO_RX_BUFS=1024 requires lower memory consumption but limits the ability to scale the workload.

Info For more info refer to the section "SNAP Environment Variables".

Tip It is recommended to configure NVMe/TCP XLIO with the transport ack timeout option increased to 12. Copy Copied! [dpu] spdk_rpc.py bdev_nvme_set_options --transport-ack-timeout 12 Other bdev_nvme options may be adjusted according to requirements.

Expose an NVMe-oF subsystem with one namespace by using a TCP transport type on the remote SPDK target.

Copy Copied! [dpu] spdk_rpc.py sock_set_default_impl -i xlio [dpu] spdk_rpc.py framework_start_init [dpu] spdk_rpc.py bdev_nvme_set_options --transport-ack-timeout 12 [dpu] spdk_rpc.py bdev_nvme_attach_controller -b nvme0 -t nvda_tcp -a 3.3.3.3 -f ipv4 -s 4420 -n nqn.2023-01.io.nvmet [dpu] snap_rpc.py nvme_subsystem_create --nqn nqn.2023-01.com.nvda:nvme:0 [dpu] snap_rpc.py nvme_namespace_create -b nvme0n1 -n 1 --nqn nqn. 2023-01.com.nvda:nvme:0 --uuid 16dab065-ddc9-8a7a-108e-9a489254a839 [dpu] snap_rpc.py nvme_controller_create --nqn nqn.2023-01.com.nvda:nvme:0 --ctrl NVMeCtrl1 --pf_id 0 --suspended --num_queues 16 [dpu] snap_rpc.py nvme_controller_attach_ns -c NVMeCtrl1 -n 1 [dpu] snap_rpc.py nvme_controller_resume -c NVMeCtrl1 -n 1 [host] modprobe -v nvme [host] fio --filename /dev/nvme0n1 --rw randrw --name=test-randrw --ioengine=libaio --iodepth=64 --bs=4k --direct=1 --numjobs=1 --runtime=63 --time_based --group_reporting --verify=md5

Info For more information on XLIO, please refer to XLIO documentation.





The SPDK version that comes with SNAP supports hardware encryption/decryption offload. To enable AES/XTS, follow the instructions under section "Modifying SF Trust Level to Enable Encryption".

SNAP offers support for zero copy with encryption for bdev_nvme with an RDMA transport.

Note If another bdev_nvme transport or base bdev other than NVMe is used, then zero copy flow is not supported, and additional DMA operations from the host to the BlueField Arm are performed.

Info Refer to section "SPDK Crypto Example" to see how to configure zero copy flow with AES_XTS offload.

Command Description mlx5_scan_accel_module Accepts a list of devices to be used for the crypto operation accel_crypto_key_create Creates a crypto key bdev_nvme_attach_controller Constructs NVMe block device bdev_crypto_create Creates a virtual block device which encrypts write IO commands and decrypts read IO commands

Accepts a list of devices to use for the crypto operation provided in the --allowed-devs parameter. If no devices are specified, then the first device which supports encryption is used.

For best performance, it is recommended to use the devices with the largest InfiniBand MTU (4096). The MTU size can be verified using the ibv_devinfo command (look for the max and active MTU fields). Normally, the mlx5_2 device is expected to have an MTU of 4096 and should be used as an allowed crypto device.

Command parameters:

Parameter Mandatory? Type Description qp-size No Number QP size num-requests No Number Size of the shared requests pool allowed-devs No String Comma-separated list of allowed device names (e.g., "mlx5_2") Note Make sure that the device used for RDMA traffic is selected to support zero copy. enable-driver No Boolean Enables accel_mlx5 platform driver. Allows AES_XTS RDMA zero copy.

Creates crypto key. One key can be shared by multiple bdevs.

Command parameters:

Parameter Mandatory? Type Description cipher Yes Number Crypto protocol (AES_XTS) key Yes Number Key key2 Yes Number Key2 name Yes String Key name

Creates NVMe block device.

Command parameters:

Parameter Mandatory? Type Description name Yes String Name of the NVMe controller, prefix for each bdev name trtype Yes String NVMe-oF target trtype (e.g., rdma, pcie) traddr Yes String NVMe-oF target address (e.g., an IP address or BDF) trsvcid No String NVMe-oF target trsvcid (e.g., a port number) addrfam No String NVMe-oF target adrfam (e.g., ipv4, ipv6) nqn No String NVMe-oF target subnqn

This RPC creates a virtual crypto block device which adds encryption to the base block device.

Command parameters:

Parameter Mandatory? Type Description base_bdev_name Yes String Name of the base bdev name Yes String Crypto bdev name key-name Yes String Name of the crypto key created with accel_crypto_key_create

The following is an example of a configuration with a crypto virtual block device created on top of bdev_nvme with RDMA transport and zero copy support:

Copy Copied! [dpu] # spdk_rpc.py mlx5_scan_accel_module --allowed-devs "mlx5_2" --enable-driver [dpu] # spdk_rpc.py framework_start_init [dpu] # spdk_rpc.py accel_crypto_key_create -c AES_XTS -k 00112233445566778899001122334455 -e 11223344556677889900112233445500 -n test_dek [dpu] # spdk_rpc.py bdev_nvme_attach_controller -b nvme0 -t rdma -a 1.1.1.1 -f ipv4 -s 4420 -n nqn.2016-06.io.spdk:cnode0 [dpu] # spdk_rpc.py bdev_crypto_create nvme0n1 crypto_0 -n test_dek [dpu] # snap_rpc.py spdk_bdev_create crypto_0 [dpu] # snap_rpc.py nvme_subsystem_create --nqn nqn.2023-05.io.nvda.nvme:0 [dpu] # snap_rpc.py nvme_controller_create --nqn nqn.2023-05.io.nvda.nvme:0 --pf_id 0 --ctrl NVMeCtrl0 --suspended [dpu] # snap_rpc.py nvme_namespace_create –nqn nqn.2023-05.io.nvda.nvme:0 --bdev_name crypto_0 –-nsid 1 -–uuid 263826ad-19a3-4feb-bc25-4bc81ee7749e [dpu] # snap_rpc.py nvme_controller_attach_ns –-ctrl NVMeCtrl0 --nsid 1 [dpu] # snap_rpc.py nvme_controller_resume –-ctrl NVMeCtrl0

Virtio transitional devices refer to devices supporting drivers conforming to modern specification and legacy drivers (conforming to legacy 0.95 specifications). For now, SNAP supports transitional device over hotplug PFs only (no support for static PFs or VFs).

To enable support for virtio-blk transitional devices, a few special configurations must be applied:

The firmware must be configured with parameter VIRTIO_EMULATION_HOTPLUG_TRANS : Copy Copied! mlxconfig -d /dev/mst/mt41692_pciconf0 s VIRTIO_EMULATION_HOTPLUG_TRANS=1 The host OS must be configured with special parameters: If the kernel version is older than 5.1, IOMMU must be disabled (i.e., set Linux boot parameter intel_iommu=off ).

If virtio-pci kernel module is built-in, then the kernel boot parameter virtio_pci.force_legacy must be set to 1 (i.e., virtio_pci.force_legacy=1 ).

If virtio-pci kernel module is loadable (not built in), use module parameter to load the module (i.e., modprobe virtio_pci force_legacy=1 ). Power cycle is required for changes to take effect.

After configuration is finished, the user may use a dedicated option to hot plug a virtio-blk function in transitional device mode:

Copy Copied! snap_rpc.py virtio_blk_emulation_device_attach --transitional_device …

Live migration is a standard process supported by QEMU which allows system administrators to pass devices between virtual machines in a live running system. For more information, refer to QEMU VFIO Device Migration documentation.

Live migration is supported for SNAP virtio-blk devices. It can be activated using a driver with proper support (e.g., NVIDIA's proprietary vDPA-based Live Migration Solution).

Copy Copied! snap_rpc.py virtio_blk_controller_create --dbg_admin_q …





Live upgrade enables updating the SNAP image used by a container without causing SNAP container downtime.

While newer SNAP releases may introduce additional content, potentially causing behavioral differences during the upgrade, the process is designed to ensure backward compatibility. Updates between releases within the same sub-version (e.g., 4.0.0-x to 4.0.0-y) should proceed without issues.

However, updates across different major or minor versions may require changes to system components (e.g., firmware, BFB), which may impact backward compatibility and necessitate a full reboot post update. In those cases, live updates are unnecessary.

To enable live upgrade, perform the following modifications:

Allocate double hugepages for the destination and source containers. Make sure the requested amount of CPU cores is available. The default YAML configuration sets the container to request a CPU core range of 8-16. This means that the container is not deployed if there are fewer than 8 available cores, and if there are 16 free cores, the container utilizes all 16. For instance, if a container is currently using all 16 cores and, during a live upgrade, an additional SNAP container is deployed. In this case, each container uses 8 cores during the upgrade process. Once the source container is terminated, the destination container starts utilizing all 16 cores. Note For 8-core DPUs, the .yaml must be edited to the range of 4-8 CPU cores. Change the name of the doca_snap.yaml file that describes the destination container (e.g., doca_snap_new.yaml ) so as to not overwrite the running container .yaml . Change the name of the new .yaml pod and container on lines 16 and 20, respectively (e.g., snap-new ). Deploy the the destination container by copying the new yaml (e.g., doca_snap_new.yaml ) to kubelet.

Note After deploying the destination container, until the live update process is complete, avoid making any configuration changes via RPC. Specifically, do not create or destroy hotplug functions.

When restoring a controller in the destination container during a live update, it is recommended to use the same arguments originally used for controller creation in the source container.

User may need to update the RPC alias since the new container name has been edited.

Note When restoring a controller in the destination container during a live update, it is recommended to use the same arguments originally used for controller creation in the source container.





The way to live upgrade the SNAP image is to move the SNAP controllers and SPDK block devices between different containers while minimizing the duration of the host VMs impact.

Source container – the running container before live upgrade

Destination container – the running container after live upgrade

Follow the steps in section "Live Upgrade Prerequisites" and deploy the destination SNAP container using the modified yaml file. Query the source and destination containers: Copy Copied! crictl ps -r Check for SNAP started successfully in the logs of the destination container, then copy the live update from the container to your environment. Copy Copied! [dpu] crictl logs -f <dest-container-id> [dpu] crictl exec <dest-container-id> cp /opt/nvidia/nvda_snap/bin/live_update.py /etc/nvda_snap/ Run the live_update.py script to move all active objects from the source container to the destination container: Copy Copied! [dpu] cd /etc/nvda_snap [dpu] ./live_update.py -s <source-container-id> -d <dest-container-id> Delete the source container. Note To post RPCs, use the crictl tool: Copy Copied! crictl exec -it <container-id X> snap_rpc.py <RPC-method> crictl exec -it <container-id Y> spdk_rpc.py <RPC-method> Note To automate the SNAP configuration (e.g., following failure or reboot) as explained in section "Automate SNAP Configuration (Optional)", spdk_rpc_init.conf and snap_rpc_init.conf must not include any configs as part of the live upgrade. Then, once the transition to the new container is done, spdk_rpc_init.conf and snap_rpc_init.conf can be modified with the desired configuration.

The live update tool is designed to support fast live updates. It iterates over the available emulation functions and performs the following actions for each one.

Info Note that the physical function controller must remain available in the source container while its virtual functions are being live-updated. Only after the virtual function controllers in the source container is destroyed, the physical function controller can be removed from that container.

On the source container: Copy Copied! snap_rpc.py virtio_blk_controller_suspend --ctrl [ctrl_name] --events_only On the destination container: Copy Copied! spdk_rpc.py bdev_nvme_attach_controller ... snap_rpc.py virtio_blk_controller_create ... --suspended --live_update_listener On the source container: Copy Copied! snap_rpc.py virtio_blk_controller_destroy --ctrl [ctrl_name] spdk_rpc.py bdev_nvme_detach_controller [bdev_name]

Message Signaled Interrupts eXtended (MSIX) is an interrupt mechanism that allows devices to use multiple interrupt vectors, providing more efficient interrupt handling than traditional interrupt mechanisms such as shared interrupts. In Linux, MSIX is supported in the kernel and is commonly used for high-performance devices such as network adapters, storage controllers, and graphics cards. MSIX provides benefits such as reduced CPU utilization, improved device performance, and better scalability, making it a popular choice for modern hardware.

However, proper configuration and management of MSIX interrupts can be challenging and requires careful tuning to achieve optimal performance, especially in a multi-function environment as SR-IOV.

By default, BlueField distributes MSIX vectors evenly between all virtual PCIe functions (VFs). This approach is not optimal as users may choose to attach VFs to different VMs, each with a different number of resources. Dynamic MSIX management allows the user to manually control of the number of MSIX vectors provided per each VF independently.

Note Configuration and behavior are similar for all emulation types, and specifically NVMe and virtio-blk.

Dynamic MSIX management is built from several configuration steps:

At this point, and in any other time in the future when no VF controllers are opened ( sriov_numvfs=0 ), all PF-related MSIX vectors can be reclaimed from the VFs to the PF's free MSIX pool. User must take some of the MSIX from the free pool and give them to a certain VF during VF controller creation. When destroying a VF controller, the user may choose to release its MSIX back to the pool.

Once configured, the MSIX link to the VFs remains persistent and may change only in the following scenarios:

User explicitly requests to return VF MSIXs back to the pool during controller destruction.

PF explicitly reclaims all VF MSIXs back to the pool.

Arm reboot (FE reset/cold boot) has occurred.

To emphasize, the following scenarios do not change MSIX configuration:

Application restart/crash.

Closing and reopening PF/VFs without dynamic MSIX support.

The following is an NVMe example of dynamic MSIX configuration steps (similar configuration also applies for virtio-blk):

Reclaim all MSIX from VFs to PF's free MSIX pool: Copy Copied! snap_rpc.py nvme_controller_vfs_msix_reclaim <CtrlName> Query the controller list to get information about the resources constraints for the PF: Copy Copied! # snap_rpc.py nvme_controller_list -c <CtrlName> … 'free_msix': 100, … 'free_queues': 200, … 'vf_min_msix': 2, … 'vf_max_msix': 64, … 'vf_min_queues': 0, … 'vf_max_queues': 31, … Where: free_msix stands for the number of total MSIX available in the PF's free pool, to be assigned for VFs, through the parameter vf_num_msix (of the <protocol>_controller_create RPC).

free_queues stands for the number of total queues (or "doorbells") available in the PF's free pool, to be assigned for VFs, through the parameter num_queues (of the <protocol>_controller_create RPC).

vf_min_msix and vf_max_msix together define the available configurable range of vf_num_msix parameter value which can be passed in <protocol>_controller_create RPC for each VF.

vf_min_queues and vf_max_queues together define the available configurable range of num_queues parameter value which can be passed in <protocol>_controller_create RPC for each VF. Distribute MSIX between VFs during their creation process, considering the PF's limitations: Copy Copied! snap_rpc.py nvme_controller_create_ --vf_num_msix <n> --num_queues <m> … Note It is strongly advised to provide both vf_num_msix and num_queues parameters upon VF controller creation. Providing only one of the values may result in a conflict between MSIX and queue configuration, which may in turn cause the controller/driver to malfunction. Tip In NVMe protocol, MSIX is used by NVMe CQ. Therefore, it is advised to assign 1 MSIX out of the PF's global pool ( free_msix ) for each assigned queue. In virtio protocol, MSIX is used by virtqueue and one extra MSIX is required for BAR configuration changes notification. Therefore, it is advised to assign 1 MSIX out of the PF's global pool ( free_msix ) for every assigned queue, and one more as configuration MSIX. In summary, the best practice for queues/MSIX ratio configuration is: For NVMe – num_queues = vf_num_msix

For virtio – num_queues = vf_num_msix -1 Upon VF teardown, release MSIX back to the free pool: Copy Copied! snap_rpc.py nvme_controller_destroy_ --release_msix … Set SR-IOV on the host driver: Copy Copied! echo <N> > /sys/bus/pci/devices/<BDF>/sriov_numvfs Note It is highly advised to open all VF controllers in SNAP in advance before binding VFs to the host/guest driver. That way, for example in case of a configuration mistake which does not leave enough MSIX for all VFs, the configuration remains reversible as MSIX is still modifiable. Otherwise, the driver may try to use the already-configured VFs before all VF configuration has finished but will not be able to use all of them (due to lack of MSIX). The latter scenario may result in host deadlock which, at worst, can be recovered only with cold boot. Note There are several ways to configure dynamic MSIX safely (without VF binding): Disable kernel driver automatic VF binding to kernel driver: Copy Copied! # echo 0 > /sys/bus/pci/devices/sriov_driver_autoprobe After finishing MSIX configuration for all VFs, they can then be bound to VMs, or even back to the hypervisor: Copy Copied! echo "0000:01:00.0" > /sys/bus/pci/drivers/nvme/bind Use VFIO driver (instead of kernel driver) for SR-IOV configuration. For example: Copy Copied! # echo 0000:af:00.2 > /sys/bus/pci/drivers/vfio-pci/bind # Bind PF to VFIO driver # echo 1 > /sys/module/vfio_pci/parameters/enable_sriov # echo <N> > /sys/bus/pci/drivers/vfio-pci/0000:af:00.2/sriov_numvfs # Create VFs device for it

NVMe recovery allows the NVMe controller to be recovered after a SNAP application is closed whether gracefully or after a crash (e.g., kill -9 ).

To use NVMe recovery, the controller must be re-created in a suspended state with the same configuration as before the crash (i.e., the same bdevs, num queues, and namespaces with the same uuid, etc).

Note The controller must be resumed only after all NSs are attached.

NVMe recovery uses files on the BlueField under /dev/shm to recover the internal state of the controller. Shared memory files are deleted when the BlueField is reset. For this reason, recovery is not supported after BF reset.

The following options are available to enable virtio-blk crash recovery.

For virtio-blk crash recovery with --force_in_order , disable the VBLK_RECOVERY_SHM environment variable and create a controller with the --force_in_order argument.

In virtio-blk SNAP, the application is not guaranteed to recover correctly after a sudden crash (e.g., kill -9 ).

To enable the virtio-blk crash recovery, set the following:

Copy Copied! snap_rpc.py virtio_blk_controller_create --force_in_order …

Note Setting force_in_order to 1 may impact virtio-blk performance as it will serve the command in-order.

Note If --force_in_order is not used, any failure or unexpected teardown in SNAP or the driver may result in anomalous behavior because of limited support in the Linux kernel virtio-blk driver.





For virtio-blk crash recovery without --force_in_order , enable the VBLK_RECOVERY_SHM environment variable and create a controller without the --force_in_order argument.

Virtio-blk recovery allows the virtio-blk controller to be recovered after a SNAP application is closed whether gracefully or after a crash (e.g., kill -9 ).

To use virtio-blk recovery without --force_in_order flag. VBLK_RECOVERY_SHM must be enabled, the controller must be recreated with the same configuration as before the crash (i.e., same bdevs, num queues, etc).

When VBLK_RECOVERY_SHM is enabled, virtio-blk recovery uses files on the BlueField under /dev/shm to recover the internal state of the controller. Shared memory files are deleted when the BlueField is reset. For this reason, recovery is not supported after BlueField reset.

SNAP can save its configuration as defined by user RPCs, and upon SNAP restart it can use the config json file to recover the configuration that was set before SNAP was closed.

This is used to recover SNAP state and cannot be used to initially configure SNAP.

Note Currently supported for virtio blk.

To use SNAP configuration recovery, set a dir path for env var SNAP_RPC_INIT_CONF_JSON . snap_config.json file in this path should be deleted if a new configuration (different than the one before SNAP closing) is set after SNAP is restarted.

Running this way ensures that your configuration is saved only after all changes in the script have been successfully performed.

When SNAP is restarted, there is no need to re-run snap rpcs or set rpcs in init files after initial configuration. SNAP will load the configuration set in SNAP_RPC_INIT_CONF_JSON file. This method is better for fast recovery.

Warning When changing ctrl/function configuration, driver must remain unloaded until configuration change is done. Otherwise, configuration recovery may fail if the configuration change has not finished successfully.

Hotplugged emulation functions are persistent between SNAP runs (not BF resets) and should be set once during SNAP's initial configuration. Only the ctrls created on these function are saved in the config state.

If the use case supports SNAP crash recovery after a reboot, the recommended file path inside the container is /etc/nvda_snap. If recovery is for an unsupported use case, store the file in a path such as /tmp/ or /dev/shm.





The following table outlines features designed to accelerate SNAP initialization and recovery processes following termination.

Feature Description How to? SPDK JSON-RPC configuration file An initial configuration can be specified for the SPDK configuration in SNAP. The configuration file is a JSON file containing all the SPDK JSON-RPC method invocations necessary for the desired configuration. Moving from posting RPCs to JSON file improves bring-up time. Info For more information check SPDK JSON-RPC documentation. To generate a JSON-RPC file based on the current configuration, run: Copy Copied! spdk_rpc.py save_config > config.json The config.json file can then be passed to a new SNAP deployment using the environment variable in the YAML SPDK_RPC_INIT_CONF_JSON . Note If SPDK encounters an error while processing the JSON configuration file, the initialization phase fails, causing SNAP to exit with an error code. Disable SPDK accel functionality The SPDK accel functionality is necessary when using NVMe TCP features. If NVMe TCP is not used, accel should be manually disabled to reduce the SPDK startup time, which can otherwise take few seconds. To disable all accel functionality edit the flags disable_signature , disable_crypto , and enable_module . Edit the config file as follows: Copy Copied! { "method" : "mlx5_scan_accel_module" , "params" : { "qp_size" : 64 , "cq_size" : 1024 , "num_requests" : 2048 , "enable_driver" : false , "split_mb_blocks" : 0 , "siglast" : false , "qp_per_domain" : false , "disable_signature" : true , "disable_crypto" : true , "enable_module" : false } Provide the emulation manager name If the SNAP_EMULATION_MANAGER environment variable is not defined (default), SNAP searches through all available devices to find the emulation manager which may slow down initialization process. Explicitly defining the device reduces the chance of initialization delays. Use SNAP_EMULATION_MANAGER to modify the the variable on the YAML. Refer to the "SNAP Environment Variables" page for more information. DPU mode for virtio-blk DPU mode is supported only with virtio-blk. DPU mode r educes SNAP downtime during crash recovery. Set VIRTIO_EMU_PROVIDER=dpu to modify the the variable on the YAML. Refer to the "SNAP Environment Variables" page for more information. SNAP configuration recovery for virtio-blk SNAP configuration recovery enables restoring the SNAP state without the need to re-post SNAP RPCs. By moving from posting individual RPCs to using a pre-saved JSON configuration file, the bring-up time is significantly improved. Set SNAP_RPC_INIT_CONF_JSON to a path where config file should be saved. Also use an SPDK JSON-RPC configuration file. Hugepages allocation SNAP allocates a mempool from hugepages. Reducing its size can impact the duration of SNAP’s crash recovery. SNAP_MEMPOOL_SIZE_MB is set to1024MB by default.

The IO/Core Multiplexer (MP) feature is a mechanism that allows users to control how I/O processing from a single source is distributed across different DPU cores. The feature offers two main policy options:

None (Default) : In this policy, all I/Os from a single source are handled by a single DPU core. Different I/O sources are evenly distributed among the available DPU cores. This approach is optimized for I/O latency and is recommended for users prioritizing low-latency operations. (Weighted) Round Robin : This policy distributes I/Os from a single source among multiple DPU cores in a round-robin fashion. For backends supporting per-core weight updates (e.g., SPDK nvmf bdev), I/Os are distributed according to given weights. Otherwise, they are evenly distributed among all DPU cores. This approach is designed to optimize I/O bandwidth and is recommended for users seeking to maximize throughput or those dealing with low per-core backend performance (e.g., TCP-based backends).

To configure the IO/Core Multiplexer policy, users need to set the IO_CORE_DISTRIBUTION_POLICY environment variable. The available options are:

"none": Refers to the default policy where all I/Os from a single source are handled by a single DPU core.

"weighted_rr": Refers to the (Weighted) Round Robin policy, distributing I/Os across multiple cores.

Note NOTE: weighted_rr policy is not supported for virtio-blk.





The SNAP ML optimizer is a tool designed to fine-tune SNAP’s poller parameters, enhancing SNAP I/O handling performance and increasing controller throughput based on specific environments and workloads.

During workload execution, the optimizer iteratively adjusts configurations (actions) and evaluates their impact on performance (reward). By predicting the best configuration to test next, it efficiently narrows down to the optimal setup without needing to explore every possible combination.

Once the optimal configuration is identified, it can be applied to the target system, improving performance under similar conditions. Currently, the tool supports "IOPS" as the reward metric, which it aims to maximize.

The device should be able to SSH to the BlueField:

Python 3.10 or above

At least 6 GB of free storage

To set up the SNAP ML optimizer:

Copy the snap_ml folder from the container to the shared nvda_snap folder and then to the requested machine: Copy Copied! crictl exec -it $(crictl ps -s running -q --name snap) cp -r /opt/nvidia/nvda_snap/bin/snap_ml /etc/nvda_snap/ Change directory to the snap_ml folder: Copy Copied! cd tools/snap_ml Create a virtual environment for the SNAP ML optimizer. Copy Copied! python3 -m venv snap_ml This ensures that the required dependencies are installed in an isolated environment. Activate the virtual environment to start working within this isolated environment: Copy Copied! source snap_ml/bin/activate Install the Python package requirements: Copy Copied! pip3 install --no-cache-dir -r requirements.txt This may take some time depending on your system's performance. Run the SNAP ML Optimizer. Copy Copied! python3 snap_ml.py --help Use the --help flag to see the available options and usage information: Copy Copied! --version Show the version and exit. -f, --framework <TEXT> Name of framework (Recommended: ax , supported: ax, pybo). -t, --total-trials <INTEGER> Number of optimization iterations. The recommended range is 25-60. --filename <TEXT> where to save the results (default: last_opt.json). --remote <TEXT> connect remotely to the BlueField card, format: <bf_name>:<username>:<password> --snap-rpc-path <TEXT> Snap RPC prefix (default: container path). --log-level <TEXT> CRITICAL | ERROR | WARN | WARNING | INFO | DEBUG --log-dir <TEXT> where to save the logs.

The snap_actions_set command is used to dynamically adjust SNAP parameters (known as "actions") that control polling behavior. This command is a core feature of SNAP-AI tools, enabling both automated optimization for specific environments and workloads, as well as manual adjustment of polling parameters.

Command parameters:

Parameter Mandatory? Type Description poll_size No Number Maximum number of IOs SNAP passes in a single polling cycle (integer; 1-256) poll_ratio No Number The rate in which SNAP poll cycles occur (float; 0< poll_ratio ≤1) max_inflights No Number Maximum number of in-flight IOs per core (integer; 1-65535) max_iog_batch No Number Maximum fairness batch size (integer; 1-4096) max_new_ios No Number Maximum number of new IOs to handle in a single poll cycle (integer; 1-4096)

The snap_reward_get command retrieves performance counters, specifically completion counters (or "reward"), which are used by the optimizer to monitor and enhance SNAP performance.

No parameters are required for this command.

To optimize SNAP’s parameters for your environment, use the following command:

Copy Copied! python3 snap_ml.py --framework ax --total-trials 40 --filename example.json --remote <bf_hostname>:<username>:<password> --log-dir <log_directory>

Once the optimization process is complete, the tool automatically applies the optimized parameters. These parameters are also saved in a example.json file in the following format:

Copy Copied! { "poll_size": 30, "poll_ratio": 0.6847347955107689, "max_inflights": 32768, "max_iog_batch": 512, "max_new_ios": 32 }

Additionally, the tool documents all iterations, including the actions taken and the rewards received, in a timestamped file named example_<timestamp>.json .

Users can apply the optimized parameters on fresh instances of SNAP service by explicitly calling the snap_actions_set RPC with the optimized parameters as follows:

Copy Copied! snap_rpc.py snap_actions_set –poll_size 30 –poll_ratio 0.6847 --max_inflights 32768 –max_iog_batch 512 –max_new_ios 32

Note It is only recommended to use the optimized parameters if the system is expected to behave similarly to the system on which the SNAP ML optimizer is used.





Once users are done using the SNAP ML Optimizer, they can deactivate the Python virtual environment by running:

Copy Copied! deactivate

For NVMe/TCP, it is required to increase timeout for the fabrics connection command to 1 second. This timeout is passed as --fabrics-timeout in bdev_nvme_oci_attach_controller RPC. Every bdev_nvme_oci_attach_controller must be extended with this parameter:

Copy Copied! [dpu] # spdk_rpc.py bdev_nvme_oci_attach_controller … --fabrics-timeout 1000000

The value provided is in microseconds.

Lazy NVMe bdev allows postponing connection establishment (admin and IO NVMe-oF queues) to the target while providing basic information about block device to SNAP. Actual connection establishment is done when the bdev's NVMe controller begins to receive IO. Lazy configuration allows specifying a list of namespaces with their namespace ID, block size, and number of blocks. It should correspond to the values on the real target.

To avoid unexpected premature connections, the SPDK bdev auto-examine functionality should be disabled. This can be done using the bdev_set_options RPC. It must be issued before the framework_start_init RPC.

Copy Copied! [dpu] # spdk_rpc.py bdev_set_options --disable-auto-examine

Lazy NVMe bdev configuration is enabled using the --lazy-conn parameter in the SPDK bdev_nvme_oci_attach_controller RPC. Its value must be a string with a comma-separated list of namespace description. Each namespace is described by the following parameters:

nsid – namespace ID

blocklen – block size in bytes

blockcnt – size of disk in block units

The following example creates two SPDK NVMe block devices. Namespace 1 with a block size of 512 bytes and 1024 blocks (512 kiB total size), and namespace 2 with a block size of 512 bytes and 8192 blocks (4MiB total size):

Copy Copied! [dpu] # spdk_rpc.py bdev_nvme_oci_attach_controller -b Nvme0 -t nvda_tcp -n nqn.2016-06.io.spdk:cnode0 -f IPv4 -a 192.168.100.3 -s 4420 --lazy-conn "nsid:1 blocklen:512 blockcnt:1024,nsid:2 blocklen:512 blockcnt:8192"

Note In multipath configuration, lazy parameters must be provided only for the first path or path_group . Subsequent paths and path groups are added to the same bdev and use the same values.





The SPDK version that comes with SNAP provides extensions to NVMe bdev to support hardware encryption/decryption offload, including support for non-standard AES/XTS tweak.

To enable non-standard AES/XTS tweak follow the instructions in the following sections:

The SPDK RPCs used in crypto NVMe bdev configuration are the following:

Command Description mlx5_scan_accel_module Accepts a list of devices to be used for the crypto operation accel_crypto_key_create Creates crypto key bdev_nvme_oci_attach_controller Creates and optionally encrypts an NVMe bdev

Accepts a list of devices to be used for the crypto operation provided in the --allowed-devs parameter. If no devices are specified, then the first device which supports crypto is used.

For the best performance it is recommended to use devices with the largest IB MTU (4096). IB MTU size can be checked with the ibv_devinfo command (search for max and active MTU fields). Normally, the mlx5_2 device is expected to have IB MTU of 4096 and should be used as an allowed crypto device.

When non-standard AES/XTS tweak is used, encryption/decryption requests must be split by 4kB (8 blocks of 512B). This can be set with --split-mb-blocks 8 .

Command parameters:

Parameter Mandatory? Type Description qp-size No Number QP size num-requests No Number Size of the shared requests pool split-mb-blocks No Number Number of data blocks to be processed as one chunk in the hardware allowed-devs No String Comma-separated list of allowed crypto device names

Creates crypto key and allows to specify tweak mode using the --tweak-mode parameter. The default value is SIMPLE_LBA and corresponds to standard tweak mode. Value of INCR_512_UPPER_LBA corresponds to OCI tweak mode. One key can be shared by multiple bdevs. Other parameters are standard and can be found in SPDK documentation.

Command parameters:

Parameter Mandatory? Type Description cipher Yes Number Crypto protocol (AES_XTS) key Yes Number Key key2 Yes Number Key2 name Yes String Key name tweak-mode No String Tweak mode. SIMPLE_LBA for standard tweak (default), INCR_512_UPPER_LBA for OCI tweak.

This RPC creates an NVMe bdev. Optionally, it allows encrypting it by specifying crypto-key . It refers to a previously created crypto key.

Command parameters:

Parameter Mandatory? Type Description crypto-key No String Key name to use

The following is an example of a configuration with an encrypted NVMe bdev and OCI tweak mode:

Copy Copied! [dpu] # spdk_rpc.py sock_set_default_impl -i xlio [dpu] # spdk_rpc.py mlx5_scan_accel_module --allowed-devs "mlx5_2" --split-mb-blocks 8 [dpu] # spdk_rpc.py framework_start_init [dpu] # spdk_rpc.py accel_crypto_key_create -c AES_XTS -k 00112233445566778899001122334455 -e 11223344556677889900112233445500 -n test_dek --tweak-mode INCR_512_UPPER_LBA [dpu] # spdk_rpc.py bdev_nvme_oci_attach_controller -b nvme0 -t NVDA_TCP -a 1.1.1.1 -f ipv4 -s 4420 -n nqn.2016-06.io.spdk:cnode0 –crypto-key test_dek [dpu] # snap_rpc.py nvme_subsystem_create --nqn nqn.2023-05.io.nvda.nvme:0 [dpu] # snap_rpc.py nvme_controller_create --nqn nqn.2023-05.io.nvda.nvme:0 --pf_id 0 --ctrl NVMeCtrl0 --suspended [dpu] # snap_rpc.py nvme_namespace_create –nqn nqn.2023-05.io.nvda.nvme:0 --bdev_name nvme0n1 –-nsid 1 -–uuid 263826ad-19a3-4feb-bc25-4bc81ee7749e [dpu] # snap_rpc.py nvme_controller_attach_ns –-ctrl NVMeCtrl0 --nsid 1 [dpu] # snap_rpc.py nvme_controller_resume –-ctrl NVMeCtrl0

SPDK allows enabling NVMe TCP header and data digests for NVMe bdev. Data digest calculation and verification can be offloaded to hardware. Header digest is not offloaded and is always calculated in software.

The SPDK RPCs used in NVMe TCP digest offload configuration are the following:

Command Description accel_set_options Set accel framework's options mlx5_scan_accel_module Enables digest offload and merge of crypto and digest operations bdev_nvme_oci_attach_controller Creates NVMe bdev with optional NVMe TCP digest

Command parameters:

Parameter Mandatory? Type Description task-count No Number Maximum number of tasks per IO channel

The maximum number of accel tasks per IO channel should be increased when data digest offload is enabled.

Command parameters:

Parameter Mandatory? Type Description enable-driver No Boolean Allows merge of AES-XTS and data digest operations to optimize the processing allowed-devs No String Comma-separated list of allowed device names to do digest offload

The enable-driver option should be enabled only when digest and crypto are both configured for NVMe block device.

Tip For best performance, it is recommended to use merge in such configuration.





This RPC creates an NVMe bdev. It allows enabling NVMe TCP header and data digest.

Command parameters:

Parameter Mandatory? Type Description hdgst No Boolean Enables header digest. Not offloaded. ddgst No Boolean Enables data digest

The following is an example of a configuration with an NVMe bdev with enabled header and data digest, and data digest offload enabled:

Copy Copied! [dpu] # spdk_rpc.py sock_set_default_impl -i xlio [dpu] # spdk_rpc.py accel_set_options --task-count 4096 [dpu] # spdk_rpc.py mlx5_scan_accel_module --allowed-devs "mlx5_2" --split-mb-blocks 8 [dpu] # spdk_rpc.py framework_start_init [dpu] # spdk_rpc.py bdev_nvme_oci_attach_controller -b nvme0 -t NVDA_TCP -a 1.1.1.1 -f ipv4 -s 4420 -n nqn.2016-06.io.spdk:cnode0 --hdgst --ddgst [dpu] # snap_rpc.py nvme_subsystem_create --nqn nqn.2023-05.io.nvda.nvme:0 [dpu] # snap_rpc.py nvme_controller_create --nqn nqn.2023-05.io.nvda.nvme:0 --pf_id 0 --ctrl NVMeCtrl0 --suspended [dpu] # snap_rpc.py nvme_namespace_create –nqn nqn.2023-05.io.nvda.nvme:0 --bdev_name nvme0n1 –-nsid 1 -–uuid 263826ad-19a3-4feb-bc25-4bc81ee7749e [dpu] # snap_rpc.py nvme_controller_attach_ns –-ctrl NVMeCtrl0 --nsid 1 [dpu] # snap_rpc.py nvme_controller_resume –-ctrl NVMeCtrl0

The SPDK version that comes with SNAP provides extended implementation of NVMe multipath. It is referred as "nested" multipath further in this chapter.

The following terminology is used when working with nested multipath:

bdev_nvme controller – local representation of remote NVMe susbsystem. The same namespace, accessible via different path groups and failover paths, is exposed as a single SPDK bdev.

Path group – alternative path to access the remote NVMe subsystem. Path groups are identified by the remote subsystem's NQN. All path groups are active at the same time. IOs are loadshared among path groups. Each path group establishes just one admin and one IO connection. The IO connection is bound to one SPDK core.

Failover path (or just path) – alternative path to access the remote NVMe subsystem. Failover paths belonging to the same path group have the same remote subsystem NQN but different IP addresses or ports. Only one failover path in the path group is active at any given moment. Inactive paths do not have established NVMe admin and IO queues. When the active failover path becomes unavailable, SPDK tries to connect to the next failover path and resubmit IOs.

Nested multipath is a special mode of multipath operation in SPDK. It should be enabled globally with the bdev_nvme_oci_set_options RPC:

Copy Copied! [dpu] # spdk_rpc.py bdev_nvme_oci_set_options --transport-retry-count 4 --transport-ack-timeout 12 --ctrlr-loss-timeout-sec -1 --reconnect-delay-sec 10 --nested-mode

Connecting to the remote subsystem for the first time adds one path group with one path:

Copy Copied! [dpu] # spdk_rpc.py bdev_nvme_oci_attach_controller -b Nvme0 -t NVDA TCP -n nqn.2016-06.io.spdk:cnode0 -f IPv4 -a 192.168.100.3 -s 4420

To add a path group to an existing bdev_nvme controller, use the same NVMe controller name ( -b parameter) but with a different subsystem NQN and with the -x multipath parameter:

Copy Copied! [dpu] # spdk_rpc.py bdev_nvme_oci_attach_controller -b Nvme0 -t NVDA TCP -n nqn.2016-06.io.spdk:cnode1 -f IPv4 -a 192.168.100.3 -s 4420 –x multipath

To add a failover path to an existing path group, use the same NVMe controller name and subsystem NQN but with a different IP address or port and with the -x failover parameter::

Copy Copied! [dpu] # spdk_rpc.py bdev_nvme_oci_attach_controller -b Nvme0 -t NVDA TCP -n nqn.2016-06.io.spdk:cnode0 -f IPv4 -a 192.168.100.3 -s 4421 –x failover

Set multipath policy to active-active:

Copy Copied! [dpu] # spdk_rpc.py bdev_nvme_oci_set_multipath_policy -b Nvme0n1 -p active_active -s round_robin -r 16

To create another bdev_nvme controller, use a new NVMe controller name:

Copy Copied! [dpu] # spdk_rpc.py bdev_nvme_oci_attach_controller -b Nvme1 -t NVDA TCP -n nqn.2016-06.io.spdk:cnode2 -f IPv4 -a 192.168.100.3 -s 4420

To delete a failover path, use bdev_nvme_oci_detach_controller RPC with the same parameters as in the corresponding bdev_nvme_oci_attach_controller :

Copy Copied! [dpu] # spdk_rpc.py bdev_nvme_oci_detach_controller -b Nvme0 -t NVDA TCP -n nqn.2016-06.io.spdk:cnode0 -f IPv4 -a 192.168.100.3 -s 4421

To disable a path group, use the bdev_nvme_oci_disable_controller RPC with the NVMe controller's name and the subsystem NQN of the path group:

Note The disabled path group is not used for IOs but keeps the path connected active.

Copy Copied! [dpu] # spdk_rpc.py bdev_nvme_oci_disable_controller -n nqn.2016-06.io.spdk:cnode0 Nvme0

To check that all IOs are completed on the disabled path group, use bdev_nvme_oci_check_controller_disabled RPC with the NVMe controller's name and the subsystem NQN of the path group. This RPC returns true when the disable procedure is completed.

Copy Copied! [dpu] # spdk_rpc.py bdev_nvme_oci_check_controller_disabled -n nqn.2016-06.io.spdk:cnode0 Nvme0

To enable a previously disabled path group, use the bdev_nvme_oci_enable_controller RPC with the NVMe controller's name and the subsystem NQN of the path group:

Copy Copied! [dpu] # spdk_rpc.py bdev_nvme_oci_enable_controller -n nqn.2016-06.io.spdk:cnode0 Nvme0

Upon enablement, one of failover paths in the path group is connected and used for IOs.

To delete a path group, a special sequence must be followed to satisfy IO fencing requirements:

Disable the path group with bdev_nvme_oci_disable_controller RPC to stop new IO submission to the path group. Wait until all pending IOs are complete by periodically checking the path group disable status. The bdev_nvme_oci_check_controller_disabled RPC returns true if all pending IOs are completed. Remove the path group by removing all paths for this path group (i.e., paths with the same bdev_nvme controller name and NQN).

Copy Copied! [dpu] # spdk_rpc.py bdev_nvme_oci_disable_controller -n nqn.2016-06.io.spdk:cnode0 Nvme0 [dpu] # spdk_rpc.py bdev_nvme_oci_check_controller_disabled -n nqn.2016-06.io.spdk:cnode0 Nvme0 [dpu] # spdk_rpc.py bdev_nvme_oci_detach_controller -b Nvme0 -t NVDA TCP -n nqn.2016-06.io.spdk:cnode0 -f IPv4 -a 192.168.100.3 -s 4421

Step 2 may never finish because the bdev_nvme_oci_check_controller_disabled RPC does not abort any IO or close any connection forcefully. The control plane should have a timeout for this case. If it times out, the options are as follows:

Re-enable the path group using the bdev_nvme_oci_enable_controller RPC and add a new healthy path using the bdev_nvme_oci_attach_controller RPC

Forcefully remove the path group using the bdev_nvme_oci_detach_controller Info Multiple path groups can be removed in parallel using this sequence.

To delete a bdev_nvme controller, remove all path groups (i.e., path groups with the same bdev_nvme controller name) following the procedure for path group delete.

The bdev group allows some functionality to be applied to a set of bdevs rather than a single bdev. The bdev group QoS is one of such functionalities. Each bdev can only be added to one group at a time.

The SPDK bdev QoS is a rate limiting mechanic, allowing to control the throughput utilized by the bdev by putting strict limitations on the amount of IOs or data it is allowed to handle per second. The bdev group QoS allows applying the same limitations for a group of bdevs.

The available limitations for both bdevs and bdev groups are as follows:

rw_ios_per_sec – read/write IOs per second

rw_mbytes_per_sec – read/write MiB per second

r_mbytes_per_sec – read MiB per second

w_mbytes_per_sec – write MiB per second

More than one limitation can be applied simultaneously.

Note For each bdev, only one of either bdev QoS or bdev group QoS is supported. If bdev group QoS is enabled, bdev QoS should not be enabled for bdevs in the bdev group. This restriction may be removed in future.

Note When using SPDK bdev QoS, direct I/O (non-buffered I/O) should be used. For Linux host, this is the O_DIRECT flag. SPDK bdev QoS cannot control how hosts are using their buffer cache for buffered I/O.

Note IOPS limits are rounded up internally to the nearest multiple of 1000. For example, 2001 and 2999 are both be rounded to 3000. On the other hand, bandwidth limits are rounded up internally to the nearest multiple of 1 MiB (1024^2).

The SPDK bdev QoS adopts a hybrid pool design combining local cache and global pool. The global pool has a quota within one millisecond. IO does not consume the quota of the global pool directly. Each core caches quota and IO consumes the quota of the local cache. If a local cache fully consumed its quota, a slice of quota is acquired from the global pool. The slice is configurable by a parameter qos_io_slice and qos_byte_slice of the bdev_set_options RPC.

Create bdev group: Copy Copied! [dpu] # spdk_rpc.py bdev_group_create Group0

Delete group: Copy Copied! [dpu] # spdk_rpc.py bdev_group_delete Group0

Add bdev to group: Copy Copied! [dpu] # spdk_rpc.py bdev_group_add_bdev Group0 Nmve0n1

Remove bdev from group: Copy Copied! [dpu] # spdk_rpc.py bdev_group_remove_bdev Group0 Nmve0n1

Set bdev group qos limit: Copy Copied! [dpu] # spdk_rpc.py bdev_group_set_qos_limit --rw-ios-per-sec 1000000 --rw-mbytes-per-sec 1000000 --r-mbytes-per-sec 1000000 --w-mbytes-per-sec 1000000 Group0

Set individual bdev qos limit: Copy Copied! [dpu] # spdk_rpc.py bdev_set_qos_limit --rw-ios-per-sec 1000000 --rw-mbytes-per-sec 1000000 --r-mbytes-per-sec 1000000 --w-mbytes-per-sec 1000000 Nvme0n1

Get bdev groups info: Copy Copied! [dpu] # spdk_rpc.py bdev_groups_get

Set the slice of quota to transfer from the global pool to a local cache: Copy Copied! [dpu] # spdk_rpc.py bdev_set_options --qos-io-slice 100 --qos-byte-slice 4096

To utilize network QoS for storage, a VLAN tag is necessary to add to each NVMe-oF controller using the vlan_tag parameter in the bdev_nvme_oci_attach_controller RPC.

Network QoS is expected to be used per SR-IOV VF. Hence, the same VLAN tag should be used for all NVMe-oF controllers within a single SR-IOV VF. Example of RPC usage is the following:

Copy Copied! [dpu] # spdk_rpc.py bdev_nvme_oci_attach_controller -b Nvme0 -t NVDA TCP -n nqn.2016-06.io.spdk:cnode0 -f IPv4 -a 192.168.100.3 -s 4420 --vlan-tag 123





The following reservation commands are supported for both NVMe and crypto bdevs:

Reservation report command (NVMe and crypto bdev)

Reservation acquire command (NVMe and crypto bdev)

Reservation release command (NVMe and crypto bdev)

Reservation register command (NVMe and crypto bdev)

Get RESCAP capabilities (NVMe and crypto bdev)

Indicate support for reservations by returning a 1 in bit 5 of the "Optional NVM Command Support" (ONCS) field in the Identify Controller data structure.

The reservation commands are limited by the following:

There is no API to set host identifier

There is no API to set reservation notification mask

Reservation Log Page Async Events are not supported (i.e., host would not receive async events related to reservation)

Get Reservation Log Page is not supported (i.e., host requests to get the reservation log page are not supported)

The maximum number of IO commands may be limited from within the DPA multiplexer (MP) provider. This would apply to the maximum number of IO commands in-flight for a namespace per weight unit.

Note Limiting the maximum in-flight on one controller namespace may impact other namespaces on the same controller.

DPA MP is a namespace-centric polling system. For instance, users may restrict the maximum in-flight commands to 100 in namespace 1.

Copy Copied! [dpu] # snap_rpc.py nvme_namespace_modify --nqn nqn.2023-05.io.nvda.nvme:0 --nsid 1 --max_inflights 100





Update the firmware configuration as described in section "OCI Firmware Configuration Example" and enable encryption as described in section "OCI DPU Configurations". Reboot the host. Configure the IP and MTU as described in section "OCI DPU Configurations". Update hugepage allocation as described in section "Step 1: Allocate Hugepages". Configure SPDK and XLIO: Copy Copied! [dpu] # spdk_rpc.py sock_set_default_impl -i xlio [dpu] # spdk_rpc.py sock_impl_set_options -i xlio -r 16777216 [dpu] # spdk_rpc.py accel_set_options --task-count 4096 [dpu] # spdk_rpc.py mlx5_scan_accel_module --allowed-devs "mlx5_2" --split-mb-blocks 8 --enable-driver [dpu] # spdk_rpc.py bdev_set_options --disable-auto-examine [dpu] # spdk_rpc.py framework_start_init [dpu] # spdk_rpc.py bdev_nvme_oci_set_options --transport-retry-count 4 --transport-ack-timeout 12 --ctrlr-loss-timeout-sec -1 --reconnect-delay-sec 10 [dpu] # spdk_rpc.py accel_crypto_key_create -c AES_XTS -k 00112233445566778899001122334455 -e 11223344556677889900112233445500 -n test_dek --tweak-mode INCR_512_UPPER_LBA Create a dummy controller on the parent PF: Copy Copied! [dpu] # snap_rpc.py nvme_subsystem_create --nqn nqn.2023-05.io.nvda.nvme:0 [dpu] # snap_rpc.py nvme_controller_create --nqn nqn.2023-05.io.nvda.nvme:0 --ctrl NVMeCtrl1 --pf_id 0 --admin_only Create 192 encrypted NVMe bdevs. For example: Copy Copied! [dpu] for i in `seq 0 191`; do \ # spdk_rpc.py bdev_nvme_oci_attach_controller -b nvme$((i+1)) -t nvda_tcp -a 1.1.3.90 -f ipv4 -s 4432 -n nqn.2023-05.io.nvme:nvme$((i+1)) --fabrics-timeout 1000000 --lazy-conn "nsid:$((i+1)) blocklen:512 blockcnt:131072" --crypto-key test_dek –-hdgst --ddgst; Done Note SPDK multipath is supported. To create multipath bdevs: Add the argument --nested-mode to the bdev_nvme_oci_set_options RPC in step 4. Run the following instead of step 6: Copy Copied! [dpu] for i in `seq 0 191`; do \ # spdk_rpc.py bdev_nvme_oci_attach_controller -b nvme$((i+1)) -t nvda_tcp -a 1.1.3.90 -f ipv4 -s 4432 -n nqn.2023-05.io.nvme:nvme$((i+1))_pg0 --fabrics-timeout 1000000 --lazy-conn "nsid:1 blocklen:512 blockcnt:131072" --crypto-key test_dek --hdgst --ddgst; # spdk_rpc.py bdev_nvme_oci_attach_controller -b nvme$((i+1)) -t nvda_tcp -a 1.1.3.90 -f ipv4 -s 4432 -n nqn.2023-05.io.nvme:nvme$((i+1))_pg1 –x multipath --fabrics-timeout 1000000 --hdgst --ddgst; # spdk_rpc.py bdev_nvme_oci_attach_controller -b nvme$((i+1)) -t nvda_tcp -a 1.1.3.91 -f ipv4 -s 4432 -n nqn.2023-05.io.nvme:nvme$((i+1))_pg0 –x failover --fabrics-timeout 1000000 --hdgst --ddgst; # spdk_rpc.py bdev_nvme_oci_attach_controller -b nvme$((i+1)) -t nvda_tcp -a 1.1.3.91 -f ipv4 -s 4432 -n nqn.2023-05.io.nvme:nvme$((i+1))_pg1 –x failover --fabrics-timeout 1000000 --hdgst --ddgst; # spdk_rpc.py bdev_nvme_oci_set_multipath_policy -b nvme$((i+1))n1 -p active_active -s round_robin -r 16 done Note SPDK QoS is supported. To configure QoS on VFs and individual SPDK block devices: Copy Copied! [dpu] for i in `seq 0 191 `; do \ # spdk_rpc.py bdev_group_create Group$((i+ 1 )); # spdk_rpc.py bdev_group_add_bdev Group$((i+ 1 )) nvme$((i+ 1 )n1; # spdk_rpc.py bdev_group_set_qos_limit --rw-ios-per-sec 300000 Group$((i+ 1 )); # spdk_rpc.py bdev_set_qos_limit --rw-ios-per-sec 150000 nvme$((i+ 1 ))n1 done Create 192 namespaces and 192 controllers: Copy Copied! [dpu] for i in `seq 0 191`; do \ # snap_rpc.py nvme_subsystem_create --nqn nqn.2023-05.io.nvda.nvme:VF${i} # snap_rpc.py nvme_namespace_create -b nvme$((i+1))n$((i+1)) -n 1 --nqn nqn.2023-05.io.nvda.nvme:VF${i} --uuid 3d9c3b54-5c31-410a-b4f0-7cf2afd9e48$((i+100)); # snap_rpc.py nvme_controller_create --nqn nqn.2023-05.io.nvda.nvme:VF${i} --ctrl NVMeCtrl$((i+2)) --pf_id 0 --vf_id $i --suspended; # snap_rpc.py nvme_controller_attach_ns -c NVMeCtrl$((i+2)) -n 1; # snap_rpc.py nvme_controller_resume -c NVMeCtrl$((i+2)) done Note In OCI's use case, NSID range is restricted to 1-32. Load the driver and configure VFs: Copy Copied! [host] # modprobe -v nvme [host] # echo 192 > /sys/bus/pci/devices/0000\:25\:00.2/sriov_numvfs

The live update flow to support containers attached to lazy bdevs deletes the bdev in the source container before creating the controller in destination container.

The following example live upgrades physical function 0 and one virtual function.

Note To modify the example to support SR-IOV, refer to section "SNAP Container Live Upgrade".

Note The SNAP Live Update tool does not support OCI's use case.

Copy Copied! crictl exec -it $(crictl ps -s running -q --name snap-src) spdk_rpc.py bdev_nvme_attach_controller -b Nullb0 -t rdma -a 1.1.1.91 -f ipv4 -s 4432 -n nqn.2016-06.io.nvmet.stor03:null0 crictl exec -it $(crictl ps -s running -q --name snap-src) snap_rpc.py nvme_controller_create --nqn nqn.2022-10.io.nvda.nvme:0 --ctrl NVMeCtrl1 --pf_id 0 --admin_only crictl exec -it $(crictl ps -s running -q --name snap-src) snap_rpc.py nvme_subsystem_create --nqn nqn.2022-10.io.nvda.nvme:0 crictl exec -it $(crictl ps -s running -q --name snap-src) snap_rpc.py nvme_controller_create --nqn nqn.2022-10.io.nvda.nvme:0 --ctrl NVMeCtrl2 --pf_id 0 --vf_id 0 --suspended --num_queues 1 crictl exec -it $(crictl ps -s running -q --name snap-src) snap_rpc.py nvme_namespace_create -b Nullb0n1 -n 1 --nqn nqn.2022-10.io.nvda.nvme:0 --uuid 16dab065-ddc9-8a7a-108e-9a489254a342 crictl exec -it $(crictl ps -s running -q --name snap-src) snap_rpc.py nvme_controller_attach_ns -c NVMeCtrl2 -n 1 crictl exec -it $(crictl ps -s running -q --name snap-src) snap_rpc.py nvme_controller_resume -c NVMeCtrl2 crictl exec -it $(crictl ps -s running -q --name snap-src) snap_rpc.py nvme_controller_destroy -c NVMeCtrl1 crictl exec -it $(crictl ps -s running -q --name snap-dst) snap_rpc.py nvme_controller_create --nqn nqn.2022-10.io.nvda.nvme:0 --ctrl NVMeCtrl1 --pf_id 0 --admin_only crictl exec -it $(crictl ps -s running -q --name snap-dst) spdk_rpc.py bdev_nvme_attach_controller -b nvme0 -t rdma -a 192.168.170.142 -f ipv4 -s 4420 -n nqn.2016-06.io.nvmet.r-nvmx02-057:snap-ver-02-0 crictl exec -it $(crictl ps -s running -q --name snap-dst) snap_rpc.py nvme_subsystem_create --nqn nqn.2022-10.io.nvda.nvme:0 crictl exec -it $(crictl ps -s running -q --name snap-dst) snap_rpc.py nvme_namespace_create -b nvme0n1 -n 1 --nqn nqn.2022-10.io.nvda.nvme:0 --uuid 16dab065-ddc9-8a7a-108e-9a489254a342 crictl exec -it $(crictl ps -s running -q --name snap-src) snap_rpc.py nvme_controller_suspend -c NVMeCtrl2 --admin_only crictl exec -it $(crictl ps -s running -q --name snap-dst) snap_rpc.py nvme_controller_create --nqn nqn.2022-10.io.nvda.nvme:0 --ctrl NVMeCtrl2 --pf_id 0 --suspended --live_update_listener --num_queues 1 crictl exec -it $(crictl ps -s running -q --name snap-dst) snap_rpc.py nvme_controller_attach_ns -c NVMeCtrl2 -n 1; crictl exec -it $(crictl ps -s running -q --name snap-src) snap_rpc.py nvme_controller_suspend -c NVMeCtrl2 --live_update_notifier;

Plugins are modular components or add-ons that enhance the functionality of the SNAP application. They integrate seamlessly with the main software, allowing additional features without requiring changes to the core codebase. Plugins are designed for use only with the source package, as it allows customization during the build process, such as enabling or disabling plugins as needed.

In containerized environments, the SNAP application is shipped as a pre-built binary with a fixed configuration. Since the binary in the container is precompiled, adding or removing plugins is not possible. The containerized software only supports the plugins included during its build. For environments requiring plugin flexibility, such as adding custom plugins, the source package must be used.

To build a SNAP source package with a plugin, perform the following instead of following the basic build steps :

Move to the sources folder. Run: Copy Copied! cd /opt/nvidia/nvda_snap/src/ Build the sources with plugin to be enabled. Run: Copy Copied! meson setup /tmp/build -Denable-bdev-null=true -Denable-bdev-malloc=true Compile the sources. Run: Copy Copied! meson compile -C /tmp/build Install the sources. Run: Copy Copied! meson install -C /tmp/build Configure the SNAP environment variables and run SNAP service as explained in sections "Configure SNAP Environment Variables" and "Run SNAP Service".

SNAP supports various types of block devices (bdev), offering flexibility and extensibility in interacting with storage backends. These bdev plugins provide different storage emulation options, allowing customization without requiring modifications to the core software.

SPDK is the default plugin used by SNAP. If no specific plugin is explicitly specified, SNAP will default to using SPDK for block device operations.

For more information, refer to spdk_bdev.

The Malloc plugin is intended for performance analysis and debugging purposes only; it is not suitable for production use

It creates a memory-backed block device by allocating a buffer in memory and exposing it as a block device

Since data is stored in memory, it is lost when the system shuts down

This plugin can be enabled using the enable-bdev-malloc build option

Malloc configuration example:

Create Malloc bdev and use it with an NVMe controller: Copy Copied! # snap_rpc.py snap_bdev_malloc_create --bdev test 64 512 # snap_rpc.py nvme_subsystem_create -s nqn.2020-12.mlnx.snap # snap_rpc.py nvme_namespace_create -s nqn.2020-12.mlnx.snap -t malloc -b test -n 1 # snap_rpc.py nvme_controller_create --pf_id=0 -s nqn.2020-12.mlnx.snap --mdts=7 # snap_rpc.py nvme_controller_attach_ns -c NVMeCtrl1 -n 1 Delete Malloc bdev: Copy Copied! # snap_rpc.py snap_bdev_malloc_destroy test Resize Malloc bdev: Copy Copied! # snap_rpc.py snap_bdev_malloc_resize test 32 This removes the existing bdev and creates a new one with the specified size. Data on the existing bdev will be lost.

The NULL plugin is designed for performance analysis and debugging purposes and is not intended for production use.

It acts as a dummy block device, accepting I/O requests and emulating a block device without performing actual I/O operations.

It is useful for testing or benchmarking scenarios that do not involve real storage devices.

The plugin consumes minimal system resources.

It can be enabled using the enable-bdev-null build option.

NULL configuration example: