SNAP-4 Service Advanced Features
RPC log history (enabled by default) records all the RPC requests (from snap_rpc.py and spdk_rpc.py) sent to the SNAP application and the RPC response for each RPC requests in a dedicated log file, /var/log/snap-log/rpc-log. This file is visible outside the container (i.e., the log file's path on the DPU is /var/log/snap-log/rpc-log as well).
The SNAP_RPC_LOG_ENABLE env can be used to enable (1) or disable (0) this feature.
RPC log history is supported with SPDK version spdk23.01.2-12 and above.
When RPC log history is enabled, the SNAP application writes (in append mode) RPC request and response message to /var/log/snap-log/rpc-log constantly. Pay attention to the size of this file. If it gets too large, delete the file on the DPU before launching the SNAP pod.
SR-IOV configuration depends on the kernel version and must be handled carefully to ensure device visibility and system stability across both hypervisor and DPU orchestrators.
To ensure a safe and stable SR-IOV setup, follow these steps:
Preconfigure VF controllers on the DPU – Before configuring SR-IOV on the host, ensure that the DPU is properly configured with all required VF controllers already created and opened.
VF functions are always visible and configurable on the DPU side. Use the following command to verify:
snap_rpc.py emulation_function_list --all
Confirm that the configuration meets your requirements.
Check the number of resources allocated for the PF, specifically MSIX and queues (queried using
snap_rpc.py virtio_blk_controller_listRPC output, fieldfree_queuesandfree_msix) are enough to satisfy your needs for all underlying VFs.Use dynamic MSIX if needed and supported.
Once host-side configuration begins, further modifications may not be possible.
Disable autoprobing with
sriov_drivers_autoprobe=0– In deployments with many virtual devices, autoprobing must be disabled to ensure stable device discovery. Failing to disable autoprobing may result in:Incomplete device visibility
Missing virtual disks
System hangs during initialization
Unreliable behavior in large-scale environments (more than 100 VFs)
TipRecommended configuration for large-scale deployments:
Disable autoprobe:
echo
0> /sys/bus/pci/devices/<BDF>/sriov_drivers_autoprobeManually bind the VFs to drivers using tools such as
driverctl, or by writing tobind/unbindinsysfs.
Configure SR-IOV on the host – For small-scale deployments (fewer than 100 VFs), use the
sriov_totalvfsentry:echo <number_of_vfs> > /sys/bus/pci/devices/<BDF>/sriov_totalvfs
For newer drivers, use:
echo <number_of_vfs> > /sys/bus/pci/devices/<BDF>/sriov_numvfs
NoteAfter SR-IOV configuration, no disks appear in the hypervisor by default. Disks are only visible inside VMs once the corresponding PCIe VF is assigned to the VM via a virtualization manager (e.g., libvirt, VMware). To use the device directly from the hypervisor, manually bind the VF to the appropriate driver.
Additional notes:
Hot-plugged PFs do not support SR-IOV.
For deployments requiring more than 127 VFs, add the following kernel parameter to the host’s boot command line:
pci=assign-busses
Without this, the host may log errors such as:
pci
0000:84:00.0: [1af4:1041] type 7fclass0xffffffpci0000:84:00.0: unknown header type 7f, ignoring deviceThese errors prevent the virtio driver from probing the device.
Zero-copy is supported on SPDK 21.07 and higher.
SNAP-direct allows SNAP applications to transfer data directly from the host memory to remote storage without using any staging buffer inside the DPU.
SNAP enables the feature according to the SPDK BDEV configuration only when working against an SPDK NVMe-oF RDMA block device.
To enable zero copy, set the environment variable (as it is enabled by default):
SNAP_RDMA_ZCOPY_ENABLE=1
For more info refer to SNAP-4 Service Environment Variables.
NVMe/TCP Zero Copy is implemented as a custom NVDA_TCP transport in SPDK NVMe initiator, and it is based on a new XLIO socket layer implementation.
The implementation is different for Tx and Rx:
The NVMe/TCP Tx Zero Copy is similar between RDMA and TCP in that the data is sent from the host memory directly to the wire without an intermediate copy to Arm memory
The NVMe/TCP Rx Zero Copy allows achieving partial zero copy on the Rx flow by eliminating copy from socket buffers (XLIO) to application buffers (SNAP). But data still must be DMA'ed from Arm to host memory.
To enable NVMe/TCP Zero Copy, use SPDK v22.05.nvda --with-xlio (v22.05.nvda or higher).
For more information about XLIO including limitations and bug fixes, refer to the NVIDIA Accelerated IO (XLIO) Documentation.
To enable SNAP TCP XLIO Zero Copy:
SNAP container: Set the environment variables and resources in the YAML file to request 6G of hugepages:
resources: requests: memory: "4Gi" cpu: "8" limits: hugepages-2Mi: "6Gi" memory: "6Gi" cpu: "16" ## Set according to the local setup env: - name: APP_ARGS value: "--wait-for-rpc" - name: SPDK_XLIO_PATH value: "/usr/lib/libxlio.so"
SNAP sources: Set the environment variables and resources in the relevant scripts
In
run_snap.sh, edit theAPP_ARGSvariable to use the SPDK command line argument--wait-for-rpc:run_snap.sh
APP_ARGS="--wait-for-rpc"
In
set_environment_variables.sh, uncomment theSPDK_XLIO_PATHenvironment variable:set_environment_variables.sh
export SPDK_XLIO_PATH="/usr/lib/libxlio.so"
NVMe/TCP XLIO requires a BlueField Arm OS hugepage size of 4Gi. For information on configuring the hugepages, refer to sections "Step 1: Allocate Hugepages" and "Adjusting YAML Configuration".
At high scale, it is required to use the global variable XLIO_RX_BUFS=4096 even though it leads to high memory consumption. Using XLIO_RX_BUFS=1024 requires lower memory consumption but limits the ability to scale the workload.
For more info refer to SNAP-4 Service Environment Variables.
It is recommended to configure NVMe/TCP XLIO with the transport ack timeout option increased to 12.
[dpu] spdk_rpc.py bdev_nvme_set_options --transport-ack-timeout 12
Other bdev_nvme options may be adjusted according to requirements.
Expose an NVMe-oF subsystem with one namespace by using a TCP transport type on the remote SPDK target.
[dpu] spdk_rpc.py sock_set_default_impl -i xlio
[dpu] spdk_rpc.py framework_start_init
[dpu] spdk_rpc.py bdev_nvme_set_options --transport-ack-timeout 12
[dpu] spdk_rpc.py bdev_nvme_attach_controller -b nvme0 -t nvda_tcp -a 3.3.3.3 -f ipv4 -s 4420 -n nqn.2023-01.io.nvmet
[dpu] snap_rpc.py nvme_subsystem_create --nqn nqn.2023-01.com.nvda:nvme:0
[dpu] snap_rpc.py nvme_namespace_create -b nvme0n1 -n 1 --nqn nqn. 2023-01.com.nvda:nvme:0 --uuid 16dab065-ddc9-8a7a-108e-9a489254a839
[dpu] snap_rpc.py nvme_controller_create --nqn nqn.2023-01.com.nvda:nvme:0 --ctrl NVMeCtrl1 --pf_id 0 --suspended --num_queues 16
[dpu] snap_rpc.py nvme_controller_attach_ns -c NVMeCtrl1 -n 1
[dpu] snap_rpc.py nvme_controller_resume -c NVMeCtrl1 -n 1
[host] modprobe -v nvme
[host] fio --filename /dev/nvme0n1 --rw randrw --name=test-randrw --ioengine=libaio --iodepth=64 --bs=4k --direct=1 --numjobs=1 --runtime=63 --time_based --group_reporting --verify=md5
For more information on XLIO, please refer to XLIO documentation.
The SPDK version included with SNAP supports hardware encryption/decryption offload. To enable AES/XTS and allow the mlx5_2 and mlx5_3 SFs to support encryption, they must be designated as trusted.
Edit the configuration file
/etc/mellanox/mlnx-sf.conf.Append the following commands to configure the
VHCA_TRUST_LEVELand create the SFs:/usr/bin/mlxreg -d 03:00.0 –-reg_name VHCA_TRUST_LEVEL –-yes –-indexes "vhca_id=0x0,all_vhca=0x1" –-set "trust_level=0x1" /usr/bin/mlxreg -d 03:00.1 --reg_name VHCA_TRUST_LEVEL --yes --indexes "vhca_id=0x0,all_vhca=0x1" --set "trust_level=0x1" /sbin/mlnx-sf –action create -–device 0000:03:00.0 -–sfnum 0 --hwaddr 02:11:3c:13:ad:82 /sbin/mlnx-sf –action create -–device 0000:03:00.1 -–sfnum 0 --hwaddr 02:76:78:b9:6f:52
Reboot the DPU to apply these changes.
Zero Copy (SNAP-direct) with Encryption
SNAP offers support for zero copy with encryption for bdev_nvme with an RDMA transport.
If another bdev_nvme transport or base bdev other than NVMe is used, then zero copy flow is not supported, and additional DMA operations from the host to the BlueField Arm are performed.
Refer to section "SPDK Crypto Example" to see how to configure zero copy flow with AES_XTS offload.
Command | Description |
| Accepts a list of devices to be used for the crypto operation |
| Creates a crypto key |
| Constructs NVMe block device |
| Creates a virtual block device which encrypts write IO commands and decrypts read IO commands |
mlx5_scan_accel_module
Accepts a list of devices to use for the crypto operation provided in the --allowed-devs parameter. If no devices are specified, then the first device which supports encryption is used.
For best performance, it is recommended to use the devices with the largest InfiniBand MTU (4096). The MTU size can be verified using the ibv_devinfo command (look for the max and active MTU fields). Normally, the mlx5_2 device is expected to have an MTU of 4096 and should be used as an allowed crypto device.
Command parameters:
Parameter | Mandatory? | Type | Description |
| No | Number | QP size |
| No | Number | Size of the shared requests pool |
| No | String | Comma-separated list of allowed device names (e.g., "mlx5_2") Note
Make sure that the device used for RDMA traffic is selected to support zero copy. |
| No | Boolean | Enables accel_mlx5 platform driver. Allows AES_XTS RDMA zero copy. |
accel_crypto_key_create
Creates crypto key. One key can be shared by multiple bdevs.
Command parameters:
Parameter | Mandatory? | Type | Description |
| Yes | Number | Crypto protocol (AES_XTS) |
| Yes | Number | Key |
| Yes | Number | Key2 |
| Yes | String | Key name |
bdev_nvme_attach_controller
Creates NVMe block device.
Command parameters:
Parameter | Mandatory? | Type | Description |
| Yes | String | Name of the NVMe controller, prefix for each bdev name |
| Yes | String | NVMe-oF target trtype (e.g., rdma, pcie) |
| Yes | String | NVMe-oF target address (e.g., an IP address or BDF) |
| No | String | NVMe-oF target trsvcid (e.g., a port number) |
| No | String | NVMe-oF target adrfam (e.g., ipv4, ipv6) |
| No | String | NVMe-oF target subnqn |
bdev_crypto_create
This RPC creates a virtual crypto block device which adds encryption to the base block device.
Command parameters:
Parameter | Mandatory? | Type | Description |
| Yes | String | Name of the base bdev |
| Yes | String | Crypto bdev name |
| Yes | String | Name of the crypto key created with |
SPDK Crypto Example
The following is an example of a configuration with a crypto virtual block device created on top of bdev_nvme with RDMA transport and zero copy support:
[dpu] # spdk_rpc.py mlx5_scan_accel_module --allowed-devs "mlx5_2" --enable-driver
[dpu] # spdk_rpc.py framework_start_init
[dpu] # spdk_rpc.py accel_crypto_key_create -c AES_XTS -k 00112233445566778899001122334455 -e 11223344556677889900112233445500 -n test_dek
[dpu] # spdk_rpc.py bdev_nvme_attach_controller -b nvme0 -t rdma -a 1.1.1.1 -f ipv4 -s 4420 -n nqn.2016-06.io.spdk:cnode0
[dpu] # spdk_rpc.py bdev_crypto_create nvme0n1 crypto_0 -n test_dek
[dpu] # snap_rpc.py nvme_subsystem_create --nqn nqn.2023-05.io.nvda.nvme:0
[dpu] # snap_rpc.py nvme_controller_create --nqn nqn.2023-05.io.nvda.nvme:0 --pf_id 0 --ctrl NVMeCtrl0 --suspended
[dpu] # snap_rpc.py nvme_namespace_create –nqn nqn.2023-05.io.nvda.nvme:0 --bdev_name crypto_0 –-nsid 1 -–uuid 263826ad-19a3-4feb-bc25-4bc81ee7749e
[dpu] # snap_rpc.py nvme_controller_attach_ns –-ctrl NVMeCtrl0 --nsid 1
[dpu] # snap_rpc.py nvme_controller_resume –-ctrl NVMeCtrl0
Live migration is a standard process supported by QEMU which allows system administrators to pass devices between virtual machines in a live running system. For more information, refer to QEMU VFIO Device Migration documentation.
Live migration is supported for SNAP virtio-blk devices in legacy and standard VFIO modes. Legacy mode uses drivers like NVIDIA's proprietary vDPA-based Live Migration Solution, while standard mode leverages the latest kernel capabilities using the virtio-vfio-pci kernel driver. Legacy mode can be enabled/disabled using the environment variable `VIRTIO_CTRL_VDPA_ADMIN_Q` (enabled by default).
In the standard virtio live migration process, the device is expected to complete all inflight I/Os, with no configurable timeout. If the remote storage is unavailable (disconnected or non-responsive), the device migration will wait indefinitely. This means migration time cannot be guaranteed, representing a degradation compared to the functionality of legacy mode.
Software Requirements for Standard VFIO
Kernel – 6.16-rc3+ (using
virtio-vfio-pcidriver)QEMU – 9.2+
libvirt – 10.6+
SNAP Configuration
Set the environment variable
VIRTIO_CTRL_VDPA_ADMIN_Qto 1 (default) for legacy or 0 for standard VFIO mode.Create a PF Controller with Admin Queue (common to both modes):
snap_rpc.py virtio_blk_controller_create --admin_q …
Live upgrade enables updating the SNAP image used by a container without causing SNAP container downtime.
While newer SNAP releases may introduce additional content, potentially causing behavioral differences during the upgrade, the process is designed to ensure backward compatibility. Updates between releases within the same sub-version (e.g., 4.0.0-x to 4.0.0-y) should proceed without issues.
However, updates across different major or minor versions may require changes to system components (e.g., firmware, BFB), which may impact backward compatibility and necessitate a full reboot post update. In those cases, live updates are unnecessary.
Live Upgrade Prerequisites
To enable live upgrade, perform the following modifications:
Allocate double hugepages for the destination and source containers.
Make sure the requested amount of CPU cores is available.
The default YAML configuration sets the container to request a CPU core range of 8-16. This means that the container is not deployed if there are fewer than 8 available cores, and if there are 16 free cores, the container utilizes all 16.
For instance, if a container is currently using all 16 cores and, during a live upgrade, an additional SNAP container is deployed. In this case, each container uses 8 cores during the upgrade process. Once the source container is terminated, the destination container starts utilizing all 16 cores.
NoteFor 8-core DPUs, the
.yamlmust be edited to the range of 4-8 CPU cores.Change the name of the
doca_snap.yamlfile that describes the destination container (e.g.,doca_snap_new.yaml) so as to not overwrite the running container.yaml.Change the name of the new
.yamlpod and container on lines 16 and 20, respectively (e.g.,snap-new).Deploy the the destination container by copying the new yaml (e.g.,
doca_snap_new.yaml) to kubelet.
After deploying the destination container, until the live update process is complete, avoid making any configuration changes via RPC. Specifically, do not create or destroy hotplug functions.
When restoring a controller in the destination container during a live update, it is recommended to use the same arguments originally used for controller creation in the source container.
User may need to update the RPC alias since the new container name has been edited.
Performing a live update causes the ML Optimizer Online service to be disabled in the source container if it is currently running. The service in the destination container remains unaffected and operates normally.
The live_update.py script is officially supported only for the following SPDK block devices: NVMe-oF/RDMA, Null, Malloc, and delay vbdev. While the script logic can be technically extended by users to support additional block devices, such configurations are not officially validated or supported by NVIDIA.
SNAP Container Live Upgrade Procedure
Follow the steps in section "Live Upgrade Prerequisites" and deploy the destination SNAP container using the modified
yamlfile.Query the source and destination containers:
crictl ps -r
Check for
SNAP started successfullyin the logs of the destination container, then copy the live update from the container to your environment.[dpu] crictl logs -f <dest-container-id> [dpu] crictl exec <dest-container-id> cp /opt/nvidia/nvda_snap/bin/live_update.py /etc/nvda_snap/
Run the
live_update.pyscript to move all active objects from the source container to the destination container:[dpu] cd /etc/nvda_snap [dpu] ./live_update.py -s <source-container-id> -d <dest-container-id>
InfoThe live update tool also supports transitioning between the SNAP source package service and the SNAP container service (and vice versa). Use
-s 0to indicate that the source (original process) is running from the SNAP source package.The Live Update tool does not support transitioning from one SNAP source package service to another SNAP source package service.
After the script completes, the live update process also completes, delete the source container by removing the YAML from kubelet tool:
NoteTo post RPCs, use the
crictltool:crictl exec -it <container-id X> snap_rpc.py <RPC-method> crictl exec -it <container-id Y> spdk_rpc.py <RPC-method>
NoteTo automate the SNAP configuration (e.g., following failure or reboot) as explained in section "Automate SNAP Configuration (Optional)",
spdk_rpc_init.confandsnap_rpc_init.confmust not include any configs as part of the live upgrade. Then, once the transition to the new container is done,spdk_rpc_init.confandsnap_rpc_init.confcan be modified with the desired configuration.
SNAP Container Live Upgrade Commands
The live upgrade process allows moving SNAP controllers and SPDK block devices between containers while minimizing host VM disruption.
The upgrade is done using a dedicated live update tool, which iterates over all active emulation functions and performs the following steps:
Suspend controller (admin only). On the source container, suspend the controller to admin-only mode. This ensures the controller no longer processes admin commands from the host driver, avoiding state changes during the handover. I/O traffic continues, so downtime has not started yet.
NVMe example:
NMVe exmaple:
snap_rpc.py nvme_controller_suspend --ctrl NVMeCtrl0VF0 --admin_only
Virtio-blk example:
Virtio-blk Example
snap_rpc.py virtio_blk_controller_suspend --ctrl [ctrl_name] --events_only
Preparation on destination container. On the destination container, create all required objects for the new controller, including attaching the backend device.
NVMe example:
NMVe exmaple:
spdk_rpc.py bdev_nvme_attach_controller ... snap_rpc.py nvme_subsystem_create ... snap_rpc.py nvme_namespace_create -n
1...Virtio-blk example:
Virtio-blk Example
spdk_rpc.py bdev_nvme_attach_controller ...
Create suspended controller (as listener). On the destination container, create the controller in a suspended state and mark it as a listener for a live update notification from the source container. At this point, the controller in the source container is still handling I/O, so downtime has not started yet.
NVMe example:
NMVe exmaple:
snap_rpc.py nvme_controller_create --pf_id
0--vf_id0--ctrl NVMeCtrl0VF0 --live_update_listener --suspended ... snap_rpc.py nvme_controller_attach_ns -c NVMeCtrl0VF0 -n1Virtio-blk example:
Virtio-blk Example
snap_rpc.py virtio_blk_controller_create --pf_id
0--vf_id0--ctrl VBLKCtrl0VF0 ...
Suspend and notify. On the source container, suspend the controller using the
--live_update_notifierflag. This triggers the start of downtime and sends a notification to the destination container. Once suspended, the controller on the destination container resumes and starts handling I/O. This marks the end of downtime.NVMe example:
snap_rpc.py nvme_controller_suspend --ctrl NVMeCtrl0VF0 --live_update_notifier --timeout_ms
Virtio-blk example:
snap_rpc.py virtio_blk_controller_suspend --ctrl [ctrl_name] --events_only
Cleanup source container. After the migration is complete, clean up any remaining controller objects on the source container.
NoteThe PF controller must remain present in the source container until all related virtual functions (VFs) have been removed.
NVMe example:
snap_rpc.py nvme_controller_detach_ns ... spdk_rpc.py bdev_nvme_detach_controller ... snap_rpc.py nvme_namespace_destroy ... snap_rpc.py nvme_controller_destroy ...
Virtio-blk example:
snap_rpc.py virtio_blk_controller_suspend --ctrl [ctrl_name] --events_only
Shared Memory Pool Live Update
Shared Memory Live Update addresses resource constraints where insufficient hugepages are available to run two full SNAP processes simultaneously during a live update.
Standard live updates typically require double the memory allocation because the source and destination containers run concurrently during the transition.
Standard Requirement: 4GB Total (2GB per process: 1GB for SNAP + 1GB for SPDK).
This feature optimizes resource usage by allowing the two processes to "share" a portion of the memory pool, significantly reducing the peak hugepage requirement.
Optimized Requirement: 3GB Total (1GB Source SPDK + 1GB Destination SPDK + 1GB Shared SNAP memory).
While physical hugepage consumption is reduced to 3GB, the container orchestration layer (e.g., Kubernetes) may still require the limit to be set to the full 4GB to allow the transition. However, the additional 1GB does not need to be physically free on the host and will not be consumed during the live update process.
Memory Pool Architecture
To facilitate this sharing, SNAP divides its memory management into two distinct pools:
Base Mempool: The essential memory required for the process to operate.
Extended Mempool: Additional memory used for standard operation, which is managed dynamically during the update.
Live Update Configuration
To enable this feature, specific environment variables must be set in the doca_vfs.yaml configuration file.
Requirements:
SNAP_MEMPOOL_SIZE_MBmust be greater thanSNAP_MEMPOOL_BASE_SIZE_MB.
Configuration example:
env:
- name: SNAP_MEMPOOL_SIZE_MB
value: "1024"
- name: SNAP_MEMPOOL_BASE_SIZE_MB
value: "512"
Live Update Procedure
This feature introduces two specific RPCs to the live update workflow: memory_manager_deallocate_basepool and memory_manager_allocate_extpool.
Prepare Source: Issue the
memory_manager_deallocate_basepoolRPC on the source (old) container to release shared resources.Deploy Destination: Start the destination (new) container. It will initialize using only the Base Mempool.
Execute Handover: Run the standard live update script and wait for completion.
Cleanup: Destroy the source container.
Finalize Destination: Issue the
memory_manager_allocate_extpoolRPC on the destination container to reclaim full memory capacity.
If the source container’s SNAP instance is killed, restarted, or recovers for any reason during this process, the state is considered invalid. You must fully restart the service before attempting the live update steps again.
Message Signaled Interrupts eXtended (MSI-X) is an interrupt mechanism that allows devices to utilize multiple interrupt vectors, offering superior efficiency compared to traditional shared interrupts. In Linux environments, MSI-X reduces CPU utilization, improves device performance, and enhances scalability for high-performance hardware like network adapters and storage controllers.
Proper configuration of MSI-X interrupts is critical in multi-function environments such as SR-IOV. By default, BlueField distributes MSI-X vectors evenly among all Virtual Functions (VFs). The default distribution, however, is often suboptimal for heterogeneous environments where VFs are attached to different VMs with varying resource requirements. Dynamic MSI-X Management allows administrators to manually control the specific number of MSI-X vectors allocated to each VF independently.
The configuration steps and behaviors described in this section apply to all emulation types, specifically NVMe and virtio-blk.
Lifecycle and Persistence
Dynamic MSI-X management follows a strict lifecycle for resource allocation and reclamation.
Allocation Workflow
Reclaim: When no VF controllers are open (
sriov_numvfs=0), PF-related MSI-X vectors are reclaimed from the VFs into the PF's global free pool.Allocate: Users allocate MSI-X vectors from the free pool to a specific VF during controller creation.
Release: Users release vectors back to the pool when destroying a VF controller.
Persistence Rules
Once configured, the MSI-X allocation for a VF remains persistent.
State Change | Effect on MSI-X Configuration |
Application Restart/Crash | No Change |
Closing/Reopening PF | No Change (unless dynamic support is used) |
Explicit VF Release | Released (Returns to Pool) |
PF Explicit Reclaim | Reclaimed (Returns to Pool) |
Arm Cold Boot | Reset (Returns to Pool) |
Configuration Procedure (NVMe Example)
The following steps demonstrate Dynamic MSI-X configuration for an NVMe controller. The logic applies similarly to virtio-blk.
Step 1: Reclaim Resources
Ensure no VFs are active, then reclaim all MSI-X vectors to the PF's free pool.
snap_rpc.py nvme_controller_vfs_msix_reclaim <CtrlName>
Step 2: Query Resource Constraints
Query the controller to view the available resources in the PF's free pool.
snap_rpc.py nvme_controller_list -c <CtrlName>
Output Definitions:
free_msix: Total MSI-X vectors available in the PF pool (assigned viavf_num_msix).free_queues: Total queues (doorbells) available in the PF pool (assigned vianum_queues).vf_min_msix/vf_max_msix: The valid configuration range for thevf_num_msixparameter.vf_min_queues/vf_max_queues: The valid configuration range for thenum_queuesparameter.
Step 3: Create VF and Distribute Resources
Create the VF controller, specifying the exact resource allocation.
snap_rpc.py nvme_controller_create --vf_num_msix <n> --num_queues <m> ...
You must specify both vf_num_msix and num_queues. Omitting one can cause a mismatch between MSI-X allocation and queue configuration, potentially leading to driver malfunctions.
Allocations differ by protocol. Use the following logic to determine values:
NVMe: MSI-X vectors are allocated per Completion Queue (CQ).
Requirement: 1 MSI-X per IO Queue + 1 MSI-X for the Admin Queue.
Virtio: MSI-X vectors are allocated per Virtqueue.
Requirement: 1 MSI-X per Queue + 1 MSI-X for BAR configuration notifications.
Best practice formula:
(?) |
|
Step 4: VF Teardown
When destroying the VF, release resources back to the free pool.
snap_rpc.py nvme_controller_destroy --release_msix ...
Step 5: Enable SR-IOV
Enable the VFs on the host driver.
echo <N> > /sys/bus/pci/devices/<BDF>/sriov_numvfs
Safe Configuration Methods
To prevent system instability, strict ordering of operations is required.
Host Deadlock Risk It is strongly recommended to open all VF controllers in SNAP before binding VFs to the host driver. If VFs are bound to the driver before configuration is complete, the driver may attempt to use resources that are not yet allocated. If resources are insufficient, this can lead to a host deadlock recoverable only by a cold boot.
To configure Dynamic MSI-X safely without risking deadlock, utilize one of the following methods:
Method A: Disable Autoprobe (Recommended)
Disable automatic driver binding, configure the VFs, and then manually bind them.
# 1. Disable autoprobe to prevent immediate binding
echo 0 > /sys/bus/pci/devices/<BDF>/sriov_driver_autoprobe
# 2. Perform SNAP Configuration (Steps 1-5 above)
# 3. Manually bind VFs to the driver
echo "0000:01:00.0" > /sys/bus/pci/drivers/nvme/bind
Method B: Use VFIO Driver
Use the vfio-pci driver for SR-IOV configuration instead of the kernel driver.
# 1. Bind PF to VFIO driver
echo 0000:af:00.2 > /sys/bus/pci/drivers/vfio-pci/bind
# 2. Enable SR-IOV support
echo 1 > /sys/module/vfio_pci/parameters/enable_sriov
# 3. Create VFs
echo <N> > /sys/bus/pci/drivers/vfio-pci/0000:af:00.2/sriov_numvfs
The recovery feature enables the restoration of controller state after the SNAP application terminates—either gracefully or unexpectedly (e.g., due to kill -9).
Recovery is only possible if the SNAP application is restarted with the exact same configuration that was active prior to the shutdown or crash.
SNAP officially supports only the following SPDK block devices for recovery: NVMe-oF/RDMA, Null, Malloc, and delay vbdev. While the script logic can be technically extended by users to support additional block devices, such configurations are not officially validated or supported by NVIDIA.
NVMe Recovery
NVMe recovery enables the restoration of an NVMe controller after a SNAP application terminates, whether gracefully or due to a crash (e.g., kill -9).
To perform NVMe recovery:
Re-create the controller in a suspended state using the exact same configuration as before the crash (including the same bdevs, number of queues, namespaces, and namespace UUIDs).
Resume the controller only after all namespaces have been attached.
The recovery process uses shared memory files located under /dev/shm on the BlueField to restore the controller's internal state. These files are deleted when the BlueField is reset, meaning recovery is not supported after a BF reset.
Virtio-blk Crash Recovery
To use virtio-blk recovery, the controller must be re-created with the same configuration as before the crash (i.e. the same bdevs, num queues, etc).
The following options are available to enable virtio-blk crash recovery.
Virtio-blk Crash Recovery with --force_in_order
For virtio-blk crash recovery with --force_in_order, disable the VBLK_RECOVERY_SHM environment variable and create a controller with the --force_in_order argument.
In virtio-blk SNAP, the application is not guaranteed to recover correctly after a sudden crash (e.g., kill -9).
To enable the virtio-blk crash recovery, set the following:
snap_rpc.py virtio_blk_controller_create --force_in_order …
Setting force_in_order to 1 may impact virtio-blk performance as it will serve the command in-order.
If --force_in_order is not used, any failure or unexpected teardown in SNAP or the driver may result in anomalous behavior because of limited support in the Linux kernel virtio-blk driver.
Virtio-blk Crash Recovery without --force_in_order
For virtio-blk crash recovery without --force_in_order, enable the VBLK_RECOVERY_SHM environment variable and create a controller without the --force_in_order argument.
Virtio-blk recovery allows the virtio-blk controller to be recovered after a SNAP application is closed whether gracefully or after a crash (e.g., kill -9).
To use virtio-blk recovery without --force_in_order flag. VBLK_RECOVERY_SHM must be enabled, the controller must be recreated with the same configuration as before the crash (i.e., same bdevs, num queues, etc).
When VBLK_RECOVERY_SHM is enabled, virtio-blk recovery uses files on the BlueField under /dev/shm to recover the internal state of the controller. Shared memory files are deleted when the BlueField is reset. For this reason, recovery is not supported after BlueField reset.
SNAP Configuration Recovery
SNAP can store its configuration as defined by user RPCs and, upon restart, reload it from a configuration JSON file. This mechanism is intended for recovering a previously configured SNAP state - it cannot be used for the initial configuration.
Usage:
Set the environment variable
SNAP_RPC_INIT_CONF_JSONto the directory path where the configuration file will be stored.The configuration file,
snap_config.json, is created in this directory after all changes in your script have been successfully applied.If a new configuration (different from the pre-shutdown configuration) is required after restarting SNAP, delete the existing
snap_config.jsonfile before applying the new settings.
When this method is used, there is no need to re-run snap RPCs or set RPCs in init files after the initial configuration — SNAP will automatically load the saved configuration from the SNAP_RPC_INIT_CONF_JSON path. This approach is recommended for fast recovery.
SNAP Configuration Recovery does not support controller modifications. That is, using SNAP Configuration Recovery after
controller_modifyRPC may cause unexpected behavior.When modifying controller or function configurations, ensure the controller/function is not bounded to any driver until the configuration process is complete. If the change is interrupted, recovery may fail.
Hotplugged emulation functions persist between SNAP runs (but not across BlueField resets) and should be set only once during initial configuration. Only controllers created on these functions are stored in the saved configuration state.
If crash recovery after a reboot is supported, store the file inside the container at
/etc/nvda_snap. For unsupported use cases, store it in a temporary location such as/tmp/or/dev/shm.
Improving SNAP Recovery Time
The following table outlines features designed to accelerate SNAP initialization and recovery processes following termination.
Feature | Description | How to? |
SPDK JSON-RPC configuration file | An initial configuration can be specified for the SPDK configuration in SNAP. The configuration file is a JSON file containing all the SPDK JSON-RPC method invocations necessary for the desired configuration. Moving from posting RPCs to JSON file improves bring-up time. Info
For more information check SPDK JSON-RPC documentation. | To generate a JSON-RPC file based on the current configuration, run:
The Note
If SPDK encounters an error while processing the JSON configuration file, the initialization phase fails, causing SNAP to exit with an error code. |
Disable SPDK accel functionality | The SPDK accel functionality is necessary when using NVMe TCP features. If NVMe TCP is not used, accel should be manually disabled to reduce the SPDK startup time, which can otherwise take few seconds. To disable all accel functionality edit the flags | Edit the config file as follows:
|
Provide the emulation manager name | If the | Use |
SNAP configuration recovery | SNAP configuration recovery enables restoring the SNAP state without the need to re-post SNAP RPCs. By moving from posting individual RPCs to using a pre-saved JSON configuration file, the bring-up time is significantly improved. | Set |
Hugepages allocation | SNAP allocates a mempool from hugepages. Reducing its size can impact the duration of SNAP’s crash recovery. | SNAP_MEMPOOL_SIZE_MB is set to1024MB by default. |
The Watchdog and Heartbeat Monitoring feature is an experimental reliability mechanism designed to enhance the robustness of the system by automatically detecting and recovering from application hangs, crashes, or unresponsive components. This mechanism minimizes service disruption by triggering recovery procedures without requiring manual intervention.
The heartbeat system functions as a periodic signal emitted by the SNAP service to indicate its operational status. These signals serve as an indicator that the service is active and functioning as expected.
A dedicated watchdog component monitors the presence and frequency of heartbeat signals. If the heartbeat is not received within a predefined timeout interval, the watchdog determines that the monitored component is unresponsive and initiates a predefined recovery action.
The typical sequence of operations is as follows:
A SNAP component becomes unresponsive due to a crash, hang, or other failure condition.
The watchdog detects the absence of the expected heartbeat signal.
A recovery action is automatically triggered.
The SNAP service is restarted, and previously configured virtual disk states are restored.
Normal operation resumes.
The entire recovery process is designed to complete within a few seconds, thereby minimizing downtime.
Configuring the Watchdog
The behavior of the Watchdog and Heartbeat Monitoring system is configurable through environment variables. These variables allow the user to specify parameters such as heartbeat intervals, timeouts, and recovery policies without requiring changes to the application code.
Environment Variable | Impact on Recovery | Default Value |
| Interval (in milliseconds) between heartbeat signals from SNAP |
|
| ID of the thread responsible for processing heartbeat signals |
|
For more configuration options, check the snap_watchdog.py script.
Running the Watchdog
To initiate the watchdog service while the SNAP application is running, execute the following command:
./snap_watchdog.py --daemon
This command launches the watchdog in the background, where it continuously monitors the health of the SNAP service and initiates recovery procedures as necessary.
snap_watchdog.py requires using python 3.7 or above.
If SNAP is running in a BFB that only contains an older version of python (i.e. Anolis 8), user must also run the command:
pip install dataclasses
The I/O Core Multiplexer (MP) is a configurable mechanism that determines how I/O requests from a single source are distributed across the available DPU cores. This feature is critical for optimizing performance based on application-specific needs, particularly in scenarios involving high I/O workloads.
The multiplexer offers two policy modes:
None (Default) – All I/O operations originating from a single source are processed by a single DPU core. I/O sources are distributed across DPU cores in a balanced manner.
Recommended for: Low-latency environments
Optimization focus: I/O latency
(Weighted) Round Robin – I/O requests from a single source are distributed across multiple DPU cores in a round-robin sequence. If the backend supports per-core weight configuration (e.g., SPDK NVMe-oF bdev), the distribution follows those weights. Otherwise, the I/Os are spread evenly.
Recommended for: Bandwidth-intensive environments or systems with low per-core backend throughput (e.g., TCP-based backends)
Optimization focus: I/O bandwidth
To configure the IO/Core Multiplexer policy, users need to set the IO_CORE_DISTRIBUTION_POLICY environment variable. The available options are:
none– Refers to the default policy where all I/Os from a single source are handled by a single DPU coreweighted_rr– Refers to the (Weighted) Round Robin policy, distributing I/Os across multiple coresNoteweighted_rrpolicy is not supported for virtio-blk.
The DPA is an auxiliary processor designed to accelerate data-path operations. It comprises a cluster of 16 cores, each containing 16 execution units (EUs).
Total capacity available to SNAP: 171 EUs (index range: 0–170).
Supported protocols: SNAP utilizes DPA applications to accelerate NVMe and virtio-blk protocols.
Hardware constraint: There is a hardware limit of 128 queues (threads) per DPA EU.
By default, all EUs (0–170) are shared between NVMe, virtio-blk, and other system DPA applications (e.g., Virtio-net).
If other DPA applications (such as Virtio-net) are running concurrently with SNAP, you must configure a DPA resource YAML file to explicitly allocate EUs and avoid resource conflicts. For more details, see Single Point of Resource Distribution.
Method 1: YAML-Based Resource Management (Recommended)
The YAML-based tool is the primary method for controlling DPA EUs, offering a centralized and consistent way to allocate resources across applications.
Configuration Requirements
Partitioning: SNAP DPA applications must run on the ROOT partition. EUs configured in other partitions are unusable.
Valid Range: Only EUs 0–170 are available for SNAP.
Allocation:
At least 1 EU must be allocated per application instance.
EU allocations must not overlap across different applications.
EU groups are not supported.
Application Names: The YAML file must match SNAP's internal application names:
dpa_helper: Used for virtio-blk and NVMe (DPU mode).NoteFor DPU mode, the number of instances should match the number of Arm cores used by SNAP and the number of EUs allocated.
dpa_virtq_split: Used for virtio-blk (DPA mode).dpa_nvme: Used for NVMe (DPA mode).
Multi-Container Configuration
When running multiple SNAP containers, you must ensure unique application naming.
Set the environment variable
SNAP_DPA_INSTANCE_ID_ENVto a unique ID inside each container.Update the YAML to reflect the instance names using the format
<APP_NAME>_<ID>.
Example (two virtio-blk DPU containers):
---
version: 25.04
---
DPA_APPS:
dpa_helper_1:
- partition: ROOT
affinity_EUs: [1-16]
dpa_helper_2:
- partition: ROOT
affinity_EUs: [17-32]
Deployment Workflow
Create Input YAML: Create your configuration file (e.g.,
~/DPA_RESOURCE_INPUT.yaml).Generate Config: Use the management tool to generate the final system configuration.
dpa-resource-mgmt config -d mlx5_0 -f ~/DPA_RESOURCE_INPUT.yaml
Set Environment Variable: Point SNAP to the generated configuration file.
export SNAP_DPA_YAML_PATH=~/ROOT.YAML
NoteIf running SNAP in containers, ensure the generated YAML file path is mounted into the container (e.g., mapped to
/etc/nvda_snap).WarningDo not manually edit the YAML file generated by the
dpa-resource-mgmttool.
Default Configuration Reference
The following represents the standard default configuration:
---
version: 25.04
---
DPA_APPS:
dpa_helper:
- partition: ROOT
affinity_EUs: [1-16]
dpa_virtq_split:
- partition: ROOT
affinity_EUs: [0-169]
dpa_nvme:
- partition: ROOT
affinity_EUs: [0-169]
Method 2: Core Mask (Specific Use Case)
The Core Mask method is an alternative configuration approach.
This method is the default and preconfigured method when running SNAP-4 concurrently with the SNAP virtio-fs service. For all other scenarios requiring DPA EU changes, use the YAML-based method described above.
Configuration
To assign a specific set of EUs via mask, set the corresponding environment variable using a hexadecimal mask.
Application | Environment Variable |
Virtio-blk / NVMe (DPU Mode) |
|
NVMe (DPA Mode) |
|
Virtio-blk (DPA Mode) |
|
Mask Logic
The core mask must contain valid hexadecimal digits and is parsed right to left.
Example: dpa_virtq_split_core_mask=0xff00
This sets 8 bits high (bits 8–15).
Result: 8 EUs are allocated (specifically EUs 8–15).
The SNAP ML Optimizer is a performance-tuning utility that dynamically adjusts polling parameters within the SNAP I/O subsystem. It is designed to improve controller throughput by identifying the optimal configuration based on current hardware, workload patterns, and system constraints.
How it works:
During runtime, the optimizer iteratively modifies internal configuration parameters (referred to as "actions").
After each configuration change, it measures the resulting system performance (referred to as the "reward").
Using predictive modeling, the optimizer determines the most promising configuration to evaluate next, allowing it to converge on an optimal setup efficiently.
This approach eliminates the need to exhaustively test all possible combinations, significantly reducing tuning time while ensuring performance gains.
Currently, the tool supports "IOPS" as the reward metric, which it aims to maximize.
SNAP ML Optimizer Online
The SNAP ML Online Optimizer continuously analyzes active workloads and applies optimization actions in the background. The system dynamically learns workload profiles and adapts quickly when known patterns reappear.
Once a workload has been characterized and a set of optimal parameters is applied, the optimizer enters an idle state until a significant change in the traffic pattern is detected.
Configuring SNAP ML Online Optimizer
The ML Optimizer can be enabled using one of two methods:
Environment Variable: Set
SNAP_ML_OPTIMIZER_ENABLED=1when launching SNAP.Runtime RPC: Execute the
snap_ml_optimizer_createRPC.
ML Optimizer RPCs
The following Remote Procedure Calls (RPCs) are used to manage the optimizer at runtime.
snap_ml_optimizer_create
Initializes or recreates the ML Optimizer, which will immediately begin analyzing and optimizing the system.
Enabling the ML Optimizer will override any system parameters previously applied via snap_actions_set, environment variables, or default settings.
snap_ml_optimizer_destroy
Stops the ML Optimizer service.
Once stopped, the system restores the default SNAP parameters.
All workload data collected by the optimizer is discarded. Any future instances of the optimizer will begin the learning process from scratch.
snap_ml_optimizer_is_running
Queries the current operational status of the ML Optimizer.
snap_ml_optimizer_current_parameters
Retrieves the currently active internal parameters. This command returns the active configuration regardless of whether it was applied by the ML Optimizer or manually configured by the user.
SNAP ML Optimizer Offline (DEPRECATED)
SNAP ML Optimizer Preparation Steps
Machine Requirements
The device should be able to SSH to the BlueField:
Python 3.10 or above
At least 6 GB of free storage
Setting Up SNAP ML Optimizer
To set up the SNAP ML optimizer:
Copy the
snap_mlfolder from the container to the sharednvda_snapfolder and then to the requested machine:crictl exec -it $(crictl ps -s running -q --name snap) cp -r /opt/nvidia/nvda_snap/bin/snap_ml /etc/nvda_snap/
Change directory to the
snap_mlfolder:cd tools/snap_ml
Create a virtual environment for the SNAP ML optimizer.
python3 -m venv snap_ml
This ensures that the required dependencies are installed in an isolated environment.
Activate the virtual environment to start working within this isolated environment:
source snap_ml/bin/activate
Install the Python package requirements:
pip3 install --no-cache-dir -r requirements.txt
This may take some time depending on your system's performance.
Run the SNAP ML Optimizer.
python3 snap_ml.py --help
Use the
--helpflag to see the available options and usage information:--version Show the version and exit. -f, --framework <TEXT> Name of framework (Recommended: ax , supported: ax, pybo). -t, --total-trials <INTEGER> Number of optimization iterations. The recommended range is 25-60. --filename <TEXT> where to save the results (default: last_opt.json). --remote <TEXT> connect remotely to the BlueField card, format: <bf_name>:<username>:<password> --snap-rpc-path <TEXT> Snap RPC prefix (default: container path). --log-level <TEXT> CRITICAL | ERROR | WARN | WARNING | INFO | DEBUG --log-dir <TEXT> where to save the logs.
SNAP ML Optimizer Related RPCs
snap_actions_set
The snap_actions_set command is used to dynamically adjust SNAP parameters (known as "actions") that control polling behavior. This command is a core feature of SNAP-AI tools, enabling both automated optimization for specific environments and workloads, as well as manual adjustment of polling parameters.
Command parameters:
Parameter | Mandatory? | Type | Description |
| No | Number | Maximum number of IOs SNAP passes in a single polling cycle (integer; 1-256) |
| No | Number | The rate in which SNAP poll cycles occur (float; 0< |
| No | Number | Maximum number of in-flight IOs per core (integer; 1-65535) |
| No | Number | Maximum fairness batch size (integer; 1-4096) |
| No | Number | Maximum number of new IOs to handle in a single poll cycle (integer; 1-4096) |
snap_actions_set cannot be used when ML Optimizer Online is enabled
snap_reward_get
The snap_reward_get command retrieves performance counters, specifically completion counters (or "reward"), which are used by the optimizer to monitor and enhance SNAP performance.
No parameters are required for this command.
Run the ML Optimizer
To optimize SNAP’s parameters for your environment, use the following command:
python3 snap_ml.py --framework ax --total-trials 40 --filename example.json --remote <bf_hostname>:<username>:<password> --log-dir <log_directory>
Results and Post-optimization Actions
Once the optimization process is complete, the tool automatically applies the optimized parameters. These parameters are also saved in a example.json file in the following format:
{
"poll_size": 30,
"poll_ratio": 0.6847347955107689,
"max_inflights": 32768,
"max_iog_batch": 512,
"max_new_ios": 32
}
Additionally, the tool documents all iterations, including the actions taken and the rewards received, in a timestamped file named example_<timestamp>.json.
Applying Optimized Parameters Manually
Users can apply the optimized parameters on fresh instances of SNAP service by explicitly calling the snap_actions_set RPC with the optimized parameters as follows:
snap_rpc.py snap_actions_set –poll_size 30 –poll_ratio 0.6847 --max_inflights 32768 –max_iog_batch 512 –max_new_ios 32
It is only recommended to use the optimized parameters if the system is expected to behave similarly to the system on which the SNAP ML optimizer is used.
Deactivating Python Environment
Once users are done using the SNAP ML Optimizer, they can deactivate the Python virtual environment by running:
deactivate
Plugins are modular components or add-ons that enhance the functionality of the SNAP application. They integrate seamlessly with the main software, allowing additional features without requiring changes to the core codebase. Plugins are designed for use only with the source package, as it allows customization during the build process, such as enabling or disabling plugins as needed.
In containerized environments, the SNAP application is shipped as a pre-built binary with a fixed configuration. Since the binary in the container is precompiled, adding or removing plugins is not possible. The containerized software only supports the plugins included during its build. For environments requiring plugin flexibility, such as adding custom plugins, the source package must be used.
To build a SNAP source package with a plugin, perform the following instead of following the basic build steps :
Move to the sources folder. Run:
cd /opt/nvidia/nvda_snap/src/
Build the sources with plugin to be enabled. Run:
meson setup /tmp/build -Denable-bdev-null=true -Denable-bdev-malloc=true
Compile the sources. Run:
meson compile -C /tmp/build
Install the sources. Run:
meson install -C /tmp/build
Configure the SNAP environment variables and run SNAP service as explained in sections "Configure SNAP Environment Variables" and "Run SNAP Service".
Bdev
SNAP supports various types of block devices (bdev), offering flexibility and extensibility in interacting with storage backends. These bdev plugins provide different storage emulation options, allowing customization without requiring modifications to the core software.
SPDK
SPDK is the default plugin used by SNAP. If no specific plugin is explicitly specified, SNAP will default to using SPDK for block device operations.
For more information, refer to spdk_bdev.
Malloc
The Malloc plugin is intended for performance analysis and debugging purposes only; it is not suitable for production use
It creates a memory-backed block device by allocating a buffer in memory and exposing it as a block device
Since data is stored in memory, it is lost when the system shuts down
This plugin can be enabled using the
enable-bdev-mallocbuild option
Malloc configuration example:
Create Malloc bdev and use it with an NVMe controller:
# snap_rpc.py snap_bdev_malloc_create --bdev test 64 512 # snap_rpc.py nvme_subsystem_create -s nqn.2020-12.mlnx.snap # snap_rpc.py nvme_namespace_create -s nqn.2020-12.mlnx.snap -t malloc -b test -n 1 # snap_rpc.py nvme_controller_create --pf_id=0 -s nqn.2020-12.mlnx.snap --mdts=7 # snap_rpc.py nvme_controller_attach_ns -c NVMeCtrl1 -n 1
Delete Malloc bdev:
# snap_rpc.py snap_bdev_malloc_destroy test
Resize Malloc bdev:
# snap_rpc.py snap_bdev_malloc_resize test 32
This removes the existing bdev and creates a new one with the specified size. Data on the existing bdev will be lost.
NULL
The NULL plugin is designed for performance analysis and debugging purposes and is not intended for production use.
It acts as a dummy block device, accepting I/O requests and emulating a block device without performing actual I/O operations.
It is useful for testing or benchmarking scenarios that do not involve real storage devices.
The plugin consumes minimal system resources.
It can be enabled using the
enable-bdev-nullbuild option.
NULL configuration example:
Create a NULL bdev and use it with an NVMe controller:
# snap_rpc.py snap_bdev_null_create_dbg test 1 512 # snap_rpc.py nvme_subsystem_create -s nqn.2020-12.mlnx.snap # snap_rpc.py nvme_namespace_create -s nqn.2020-12.mlnx.snap -t malloc -b test -n 1 # snap_rpc.py nvme_controller_create --pf_id=0 -s nqn.2020-12.mlnx.snap --mdts=7 # snap_rpc.py nvme_controller_attach_ns -c NVMeCtrl1 -n 1
Delete the NULL bdev:
# snap_rpc.py snap_bdev_null_destroy_dbg test