NVIDIA DOCA SNAP-3 User Guide
SNAP
NVIDIA® BlueField® SNAP and virtio-blk SNAP (Storage-defined Network Accelerated Processing) technology enables hardware-accelerated virtualization of local storage. NVMe/virtio-blk SNAP presents networked storage as a local block-storage-device such as an SSD, emulating a local drive on the PCIe bus. The host OS/hypervisor uses its standard storage driver, unaware that communication is done, not with a physical drive, but with NVMe/virtio-blk SNAP framework. Any logic may be applied to the I/O requests or to the data via the NVMe/virtio-blk SNAP framework prior to redirecting the request and/or data over a fabric-based network to remote or local storage targets.
NVMe/virtio-blk SNAP is based on NVIDIA® BlueField-2 DPU family technology and combines unique hardware-accelerated storage virtualization with the advanced networking and programmability capabilities of the DPU. NVMe/virtio-blk SNAP together with the BlueField DPU enable a world of applications addressing storage and networking efficiency and performance.
The traffic from a host-emulated PCIe device is redirected to its matching storage controller opened on the mlnx_snap
service. The controller, from its side, holds at least one open backend device (usually SPDK block device). When a command is received, the controller executes it. Admin commands are answered immediately, while I/O commands are redirected to the backend device for processing. The request handling pipeline is completely asynchronous and the workload is distributed across all Arm cores (allocated to SPDK application) to achieve the best performance.
The following are key concepts for SNAP:
Full flexibility in fabric/transport/protocol (e.g. NVMe-oF/iSCSI/other, RDMA/TCP, ETH/IB)
NVMe and virtio-blk emulation support
Easy data manipulation
Using Arm cores for data path
BlueField SNAP/virtio-blk SNAP are licensed software. Users must purchase a license per BlueField device to use them.
Libsnap
Libsnap is a common library designed to assist in common tasks for applications wishing to interact with emulated hardware over BlueField DPUs and take the most advantage from the hardware capabilities. As such, libsnap exposes a simple API for the upper layer application to create, modify, query, and destroy different emulation objects, such PCIe BAR management, emulated queues etc.
In addition, the library provides a set of helper functions to perform efficient DMA transactions between host and DPU memory.
SNAP application makes extensive usage of the libsnap library for resource management and efficient DMA operations required by the storage controllers.
SNAP Installation Process
DPU Image Installation
The BlueField OS image (BFB) includes all needed packages for mlnx_snap to operate: MLNX_OFED, RDMA-CORE libraries, and the supported SPDK version, libsnap and mlnx-snap headers, libraries and binaries.
To see which operating systems are supported, refer to the BlueField Software Documentation under Release Notes → Supported Platforms and Interoperability → Supported Linux Distributions.
RShim must be installed on the host to connect to the NVIDIA® BlueField® DPU. To install RShim, please follow the instructions described in the BlueField Software Documentation → BlueField DPU SW Manual → DPU Operation → DPU Bring-up and Driver Installation → Installing Linux on DPU → Step 1: Set up the RShim Interface.
Use RShim interface from the x86 host machine to install the desired image:
BFB=/<path>/latest-bluefield-image.bfb
cat $BFB > /dev/rshim0/boot
Optionally, it is possible to connect to the remote console of the DPU and watch the progress of the installation process. Using the screen
tool, for example:
screen /dev/rshim0/console
Post-installation Configuration
Firmware Configuration
Refer to Firmware Configuration to confirm the firmware configuration matches the SNAP application's requirements (SR-IOV support, MSI-X resources, etc).
Network Configuration
Before enabling mlnx_snap
or configuring it, users must first verify that uplink ports are configured correctly, and that network connectivity toward the remote target works properly.
By default, two SF interfaces are opened—one over each PF as configured in /etc/mellanox/mlnx-sf.conf
—which match RDMA devices mlx5_2
and mlx5_3
respectively. As mentioned, only these interfaces may support RoCE/RDMA transport for the remote storage.
If working with an InfiniBand link, an active InfiniBand port must be made available to allow for InfiniBand support. Once an active IB port is available, users must configure the port RDMA device in the JSON configuration file (see rdma_device
under "Configuration File Examples" for mlnx_snap
to work on that port).
If working with bonding, it is transparent to MLNX SNAP configuration, and no specific configuration is necessary on NVMe/virtio-blk SNAP level.
Out-of-box Configuration
NVMe/virito-blk SNAP is disabled by default. Once enabled (see section "Firmware Configuration"), the out-of-box configuration of NVMe/virtio-blk SNAP includes a single NVMe controller, backed by a 64MB RAM-based SPDK block device (e.g. RAM drive) in non-offload mode. Out-of-box configuration does not include virtio-blk devices.
A sample configuration file for the out-of-box NVMe controller is located in /etc/mlnx_snap/mlnx_snap.json
. For additional information about its values, please see section "Non-offload Mode".
The default initialization command set is described in /etc/mlnx_snap/spdk_rpc_init.conf
and /etc/mlnx_snap/snap_rpc_init.conf
, as follows:
spdk_rpc_init.conf
bdev_malloc_create 64 512
snap_rpc_init.conf
subsystem_nvme_create Mellanox_NVMe_SNAP "Mellanox NVMe SNAP Controller" controller_nvme_create mlx5_0 --subsys_id 0 --pf_id 0 controller_nvme_namespace_attach -c NvmeEmu0pf0 spdk Malloc0 1
BlueField out-of-box configuration is slightly different than BlueField-2. For a clean out-of-box experience, /etc/mlnx_snap/snap_rpc_init.conf
is a symbolic link pointing to the relevant hardware-oriented configuration.
To make any other command set persistent, users may update and modify /etc/mlnx_snap/spdk_rpc_init.conf
and /etc/mlnx_snap/snap_rpc_init.conf
for according to their needs. Refer to section "SNAP Commands" for more information.
SNAP Service Control (systemd)
To enable, start, stop, or check the status of SNAP service, run:
systemctl {start | stop | status} mlnx_snap
Logging
The mlnx_snap application output is captured by the SystemD and stored in the internal database. Users are able to get the output from the service console by using the following SystemD commands:
systemctl status mlnx_snap
journalctl -u mlnx_snap
SystemD keeps logs in a binary format under the /var/run/log/journal/
directory, which is stored on the tmpfs (i.e. it is not persistent).
SystemD redirects log messages into the rsyslog
service. Configuration of rsyslog
is default to CentOS/RHEL, so users may find all those messages in the /var/log/messages
directory.
The rsyslog
daemon may be configured to send messages to a remote (centralized) syslog server if desired.
RPC Command Interface
Remote Procedure Call (RPC) protocol is a very simple protocol defining a few data types and commands. NVMe/virtio-blk SNAP, like other standard SPDK applications, supports JSON-based RPC protocol commands, to control any resources create/delete/query/modify commands easily from CLI.
The mlnx_snap application supports executing all standard SPDK RPC commands, in addition to an extended SNAP-specific command set. SPDK standard commands are executed by the standard spdk_rpc.py
tool, while the SNAP-specific command set extension is executed by an equivalent snap_rpc.py
tool.
Full spdk_rpc.py
command set documentation can be found in the SPDK official documentation site.
Full snap_rpc.py
extended command is detailed later in this chapter.
PCIe Function Management Commands
Emulation Managers Discovery
Emulated PCIe functions are managed through IB devices called "emulation managers". The emulation managers are ordinary IB devices (e.g. mlx5_0, mlx5_1, etc.) with special privileges to also control PCIe communication and device emulations towards the host operating system. Numerous emulation managers may co-exist, each with its own set of capabilities.
The list of emulation managers with their capabilities can be queried using the following command:
snap_rpc.py emulation_managers_list
Appendix SPDK Configuration includes additional information.
Emulation Devices Configuration (Hotplug)
As mentioned above, each emulation manager holds a list of the emulated PCI functions it controls. The PCI functions may be approached later by either their function index (in emulation manager’s list) or their PCIe BDF number (e.g. 88:00.2) as enumerated by the host OS. Some PCIe functions configured at the FW configuration stage are considered "static" (i.e. always present).
In addition, users can dynamically add detachable functions to that list at runtime (and to the host's PCIe devices list accordingly). These functions are called "Hotplugged" PCIe functions.
After a new PCIe function is plugged, it is shown on host PCIe devices list until either explicitly unplugged or the system goes through a cold reboot. Hot-plugged PCIe function will remain persistent even after SNAP process termination.
Some OSs automatically start to communicate with the new function after it is plugged. Some continue to communicate with the function (for a certain time) even after it is signaled to be unplugged. Therefore, users must always keep an open controller (of a matching type) over any existing configured PCIe function (see NVMe Controller Management and Virtio-blk Controller Management for more details).
The following command hotplugs a new PCIe function to the system:
snap_rpc.py emulation_device_attach emu_manager {nvme,virtio_blk} [--id ID] [--vid VID] [--ssid SSID] [--ssvid SSVID] [--revid REVID] [--class_code CLASS_CODE] [--bdev_type {spdk,none}] [--bdev BDEV] [--num_queues NUM_QUEUES] [--queue_depth QUEUE_DEPTH][--total_vf TOTAL_VF] [--num_msix NUM_MSIX]
The following command hot-unplugs a new PCIe function from the system:
snap_rpc.py emulation_device_detach <emu_manager> {nvme,virtio_blk} [-d PCI_BDF / -i PCI_INDEX / --vuid VUID]
Mandatory parameters:
emu_manager
– emulation manager{nvme,virtio_blk}
– device type
Optional arguments:
--pci_bdf
– PCIe BDF identifier--pci_index
– PCIe index identifier--vuid
– PCIe VUID identifier--force
– forcefully remove device (not recommended)NoteAt least one identifier must be provided to describe the PCIe function to be detached.
Once a PCIe function is unplugged from the host system (when calling emulation_device_detach
), its controller will be deleted implicitly also.
The following command lists all existing functions (either static or hotplugged):
snap_rpc.py emulation_functions_list
NVMe Emulation Management Commands
NVMe Subsystem
The NVMe subsystem, as described in the NVMe specification, is a logical entity which encapsulates sets of NVMe backends (or namespaces) and connections (or controllers). NVMe subsystems are extremely useful when working with multiple NVMe controllers, and especially when using NVMe virtual functions.
Each NVMe subsystem is defined by its serial number (SN), model number (MN), and qualified name (NQN). After creation, each subsystem also gets a unique index number.
The following example creates a new NVMe subsystem with a default generated NQN:
snap_rpc.py subsystem_nvme_create <serial_number> <model_number>
Mandatory parameters:
serial_number
– subsystem serial numbermodel_number
– subsystem model number
Optional arguments:
--nqn
– subsystem qualified name (auto-generated if not provided)--nn
– maximal namespace ID allowed in the subsystem (default 0xFFFFFFFE; range 1-0xFFFFFFFE)--mnan
– maximal number of namespaces allowed in the subsystem (default 1024; range 1-0xFFFFFFFE)
The following command deletes an NVMe subsystem:
snap_rpc.py subsystem_nvme_delete <NQN>
Where
is the subsystem's NQN.
The following command lists all NVMe subsystems:
snap_rpc.py subsystem_nvme_list
NVMe Controller
Each NVMe device (e.g. NVMe PCIe entry) exposed to the host, whether it is a PF or VF, must be backed by NVMe controller, which is responsible for all protocol communication with the host's driver. Every new NVMe controller must also be linked to NVMe subsystem. After creation, NVMe controllers can be addressed using either their name (e.g., "NvmeEmu0pf0") or both their subsystem NQN and controller ID.
The following command opens a new NVMe controller:
snap_rpc.py controller_nvme_create mlx5_0 [--pf_id ID / --pci_bdf / --vuid VUID] [--subsys_id ID / --nqn NQN]
Mandatory parameters:
emu_manager
– emulation manager
Optional parameters:
--vf_id VF_ID
– PCIe VF index to start emulation on (if the controller is destined to be opened on a VF).--pf_id
must also be set for the command to take effect.--conf
– JSON configuration file path to be used to provide an extended set of configuration parameters. Full information concerning the different parameters of the configuration file can be found under appendix "JSON File Format".--nr_io_queues
– I/O queues maximal number (default 32, range 0-32)--mdts
– maximum data transfer size (default 4, range 1-6)--max_namespaces
– maximum number of namespaces for this controller (default 1024, range 1-0xFFFFFFFE)--quirks
– bitmask to enable specific NVMe driver quirks to work with non-NVMe spec compliant drivers. For more information, refer to appendix "JSON File Format".--mem {static,pool}
– use memory from a global pool or from dedicated buffers. See "Mem-pool" for more information.
The following command deletes an existing NVMe controller:
snap_rpc.py controller_nvme_delete [--name NAME / --subnqn SUBNQN --cntlid ID / --vuid VUID]
Optional arguments:
-c NAME
,--name NAME
– controller Name. Must be set if--nqn
and--cntlid
are not set.-n SUBNQN
,--subnqn SUBNQN
– NVMe subsystem (NQN). Must be set if--name
is not set.-i CNTLID
,--cntlid CNTLID
– controller identifier in NVMe subsystem. Must be set if--name
is not set.
The following command lists all NVMe controllers:
snap_rpc.py controller_list --type nvme
Optional arguments:
-t {nvme,virtio_blk,virtio_net}
,--type {nvme,virtio_blk,virtio_net}
– controller type
NVMe Backend (Namespace)
NVMe Namespaces are the representors of a continuous range of LBAs in the local/remote storage device (previously configured in "Backend Configuration" section). Each Namespace must be linked to a controller and have a unique identifier (NSID) across the entire NVMe subsystem (e.g. 2 namespaces cannot share the same NSID, even though linked to different controllers).
SNAP application uses SPDK block device framework as backends for its NVMe namespaces. Therefore, they should be configured in advance. For more information about SPDK block devices, see SPDK BDEV documentation and appendix "SPDK Configuration".
The following command attaches a new namespace to an existing NVMe controller:
snap_rpc.py controller_nvme_namespace_attach [--ctrl CTRL / --subnqn SUBNQN --cntlid ID] <bdev_type> <bdev> <nsid>
Mandatory parameters:
bdev_type
– block device typebdev
– block device to use as backendnsid
– namespace ID
Optional parameters:
-c CTRL
,--ctrl CTRL
– controller name. Must be set if--nqn
and--cntlid
are not set.-n SUBNQN
,--subnqn SUBNQN
– NVMe subsystem (NQN). Must be set if--ctrl
is not set.-i CNTLID
,--cntlid CNTLID
– controller identifier in NVMe subsystem. Must be set if--ctrl
is not set.-q QN
,--qn QN
– QN of remote target which provides this namespace-p PROTOCOL
,--protocol PROTOCOL
– protocol used-g NGUID
,--nguid NGUID
– namespace globally unique identifier-e EUI64
,--eui64 EUI64
– namespace EUI-64 identifier-u UUID
,--uuid UUID
– namespace UUID
In full-offload mode, backends are acquired from the remote storage automatically and no manual configuration is required.
The following command detaches a namespace from a controller:
snap_rpc.py controller_nvme_namespace_detach [--ctrl CTRL / --subnqn SUBNQN --cntlid ID] <nsid>
Mandatory parameters:
nsid
– namespace ID
Optional parameters:
-c CTRL
,--ctrl CTRL
– controller name. Must be set if--nqn
and--cntlid
are not set.-n SUBNQN
,--subnqn SUBNQN
– NVMe subsystem (NQN). Must be set if--ctrl
is not set.-i CNTLID
,--cntlid CNTLID
– controller identifier in NVMe subsystem. Must be set if--ctrl
is not set.
The following command lists a namespace of a controller:
snap_rpc.py controller_nvme_namespace_list [--ctrl CTRL / --subnqn SUBNQN --cntlid ID]
Optional parameters:
-c CTRL
,--ctrl CTRL
– controller name. Must be set if--nqn
and--cntlid
are not set.-n SUBNQN
,--subnqn SUBNQN
– NVMe subsystem (NQN). Must be set if--ctrl
is not set.-i CNTLID
,--cntlid CNTLID
– controller identifier in NVMe subsystem. Must be set if--ctrl
is not set.
Virtio-blk Emulation Management Commands
Virtio-blk Controller
Each virtio-blk device (e.g. virtio-blk PCI entry) exposed to the host, whether it is PF or VF, must be backed by a virtio-blk controller. Virtio-blk is considered a limited storage protocol (compared to NVMe, for instance).
Due to protocol limitations:
Trying to use a virtio-blk device (e.g. probe virtio-blk driver on host) without an already functioning virtio-blk controller, may cause the host server to hang until such controller is opened successfully (no timeout mechanism exists)
Upon creation of a virtio-blk controller, a backend device must already exist
The following command creates a new virtio-blk controller:
snap_rpc.py controller_virtio_blk_create <emu_manager> [-d PCI_BDF / --pf_id PF_ID / --vuid VUID]
Mandatory parameters:
emu_manager
– emulation manager
Optional parameters:
--vf_id
– PCIe VF index to start emulation on, if controller is destined to be opened on VF--num_queues
– number of queues (default 64, range 2-256)--queue_depth
– queue depth (default 128, range 1-256)--size_max
–- maximal SGE data transfer size (default 4096, range 1 – MAX_UINT16). See virtio specification for more information.--seg_max
– maximal SGE list length (default 1, range 1-queue_depth). See virtio specification for more information.--bdev_type
– block device type (spdk/none). Note that opening a controller withnone
backend means to open it with a backend of size of 0.--bdev
– SPDK block device to use as backend--serial
– serial number for the controller--force_in_order
- force I/O in-order completions. Note that this value is required to ensure future virtio-blk controllers always successfully recover after an application crash.--suspend
– create controller is in a SUSPENDED state (must be explicitly resumed later)--mem {static,pool}
– use memory from a global pool or from dedicated buffers. See "Advanced Features" for more information.
The following command deletes a virtio-blk controller:
snap_rpc.py controller_virtio_blk_delete [-c NAME / --vuid VUID]
Mandatory arguments:
name
– controller name
Optional arguments:
--f
,--force
– force controller deletion
The following command lists all virtio-blk controllers:
snap_rpc.py controller_list --type virtio_blk
Optional arguments:
-t {nvme,virtio_blk,virtio_net}
,--type {nvme,virtio_blk,virtio_net}
– controller type
Virtio-blk controllers can also be suspended and resumed. While suspended, the controller stops receiving new requests from the host driver and only finishes handling of requests already in flight (without prompting any IO errors back to the driver).
The following command suspends/resumes virtio-blk controller:
snap_rpc.py controller_virtio_blk_suspend <name>
snap_rpc.py controller_virtio_blk_resume <name>
Mandatory arguments:
name
– controller name
Virtio-blk Backend Management
Like NVMe, virtio-blk also uses SPDK block devices framework as its backend devices, but since virtio-blk is a limited storage protocol as opposed to NVMe, whose backend management abilities are limited as well:
Virtio-blk protocol supports only one backend device
Virtio-blk protocol does not support administration commands to add backends, thus all backend attributes are communicated to the host virtio-blk driver over PCIe BAR and must be accessible during driver probing. For that reason, backends can only be changed when PCIe function is not in use by any host storage driver.
For these reasons, when the host driver is active, all backend management operations must occur only when the controller is in suspended state.
The following command attaches a new backend to a controller:
snap_rpc.py controller_virtio_blk_bdev_attach <ctrl_name> {spdk} <bdev_name>
Mandatory arguments:
name
– controller name{spdk}
– block device typebdev
– block device to use as backend
Optional arguments:
--size_max
– maximal SGE data transfer size (no hard limit). See Virtio specification for more information.--seg_max
– maximal SGE list length (no hard limit). See Virtio specification for more information.
The following command detaches a backend from a controller:
snap_rpc.py controller_virtio_blk_bdev_detach <ctrl_name>
Mandatory arguments:
ctrl_name
– controller name
Destruction of SPDK block devices using SPDK block devices' API is considered a controller_virtio_blk_bdev_detach
and is bound to the same limitations.
The following command lists a backend detail of a controller:
snap_rpc.py controller_virtio_blk_bdev_list <name>
Mandatory arguments:
name
– controller name
Debug and Statistics
BlueField and virito-blk SNAP provide a set of commands which help customers retrieve performance and debug statistics about the opened emulated devices. The statistics are provided at the SNAP controller level (whether for NVMe or Virtio-blk).
IO Statistics
The following commands are available to measure how many successful/failed IO operations were executed by the controller.
These commands have minimal effect on BlueField SNAP performance and can therefore be used to sample statistics while the controller performs high bandwidth IO operations.
snap_rpc.py controller_nvme_get_iostat [-c CTRL_NAME]
snap_rpc.py controller_virtio_blk_get_iostat [-c CTRL_NAME]
These commands have minimal effect on BlueField SNAP performance and can therefore be used to sample statistics while the controller performs high-bandwidth IO operations.
Mandatory arguments:
CTRL_NAME
– controller name
NVMe/Virtio IO statistics:
read_ios
– number of read commands handledcompleted_read_ios
– number of read commands completed successfullyerr_read_ios
– number of read commands completed with errorwrite_ios
– number of write commands handledcompleted_write_ios
– number of write commands completed successfullyerr_write_ios
– number of write commands completed with errorflush_ios
– number of flush commands handledcompleted_flush_ios
– number of flush commands completed successfullyerr_flush_ios
– number of flush commands completed with error
Virtio IO specific statistics:
fatal_ios
– number of commands dropped and never completedoutstanding_in_ios
– number of outstanding IOs at a given momentoutstanding_in_bdev_ios
– number of outstanding IOs at a given moment, pending backend handlingoutstanding_to_host_ios
– number of outstanding IOs at a given moment, pending DMA handling
Debug Statistics
The following commands are available to examine the controller and queues with more detailed status and information.
When queried frequently, these commands may impact performance and should therefore be called for debug purposes only.
snap_rpc.py controller_nvme_get_debugstat [-c NAME]
snap_rpc.py controller_virtio_blk_get_debugstat [-c NAME]
Initialization Scripts
The default initialization scripts /etc/mlnx_snap/spdk_rpc_init.conf
and /etc/mlnx_snap/snap_rpc_init.conf
allow users to control the startup configuration.
These scripts, which are used for the out-of-box configuration, may be modified by the user to control the SNAP initialization :
The
spdk_rpc_init.conf
may be modified with the SPDK commands listed under the SPDK Configuration appendix.The
snap_rpc_init.conf
may be modified with thesnap_rpc
commands described throughout this chapter (SNAP Commands).
Performance Optimization
Tuning MLNX SNAP for the best performance may require additional resources from the system (CPU, memory) and may affect SNAP controller scalability.
Increasing Number of Used Arm Cores
By default, MLNX SNAP uses 4 Arm cores, with core mask 0xF0. The core mask is configurable in /etc/default/mlnx_snap
(parameter CPU_MASK
) for best performance (i.e., CPU_MASK=0xFF
).
As SNAP is an SPDK based application, it constantly polls the CPU and therefore occupies 100% of the CPU it runs on.
Disabling Mem-pool
When mem-pool is enabled, that reduces the memory footprint but decreases the overall performance.
To configure the controller to not use mem-pool, set MEM_POOL_SIZE=0
in /etc/default/mlnx_snap
.
See section "Mem-pool" for more information.
Maximizing Single IO Transfer Data Payload
Increasing datapath staging buffer sizes improves performance for larger block sizes (>4K):
For NVMe, this can be controlled by increasing the MDTS value either in the JSON file or the RPC parameter. For more information regarding MDTS, refer to the NVMe specification. The default value is 4 (64K buffer), and the maximum value is 6 (256K buffer).
For virtio-blk, this can be controlled using the
seg_max
andsize_max
RPC parameters. For more information regarding these parameters, refer to the VirtIO-blk specification. No hard-maximum limit exists.
Increasing Emulation Manager MTU
The default MTU for the emulation manager network interface is 1500. Increasing MTU to over 4K on the emulation manager (e.g., MTU=4200) also enables the SNAP application to transfer larger amount of data in a single Host→DPU memory transactions, which may improve performance.
Optimizing Number of Queues and MSIX Vector (virtio-blk only)
SNAP emulated queues are spread evenly across all configured PFs (static and dynamic) and defined VFs per PF (whether functions are being used or not). This means that the larger the total number of functions SNAP is configured with (either PFs or VFs), the less queues and MSIX resources each function will be assigned which would affect its performance accordingly. Therefore, it is recommended to configure in Firmware Configuration the minimal number of PFs and VFs per PF desired for that specific system.
Another consideration is matching between MSIX vector size and the desired number of queues. The standard virtio-blk kernel driver uses an MSIX vector to get events on both control and data paths. When possible, it assigns exclusive MSIX for each virtqueue (e.g., per CPU core) and reserves an additional MSIX for configuration changes. If not possible, it uses a single MSIX for all virtqueues. Therefore, to ensure best performance with virtio-blk devices, the condition VIRTIO_BLK_EMULATION_NUM_MSIX
> virtio_blk_controller.num_queues
must be applied.
The total number of MSIXs is limited on BlueField-2 cards, so MSIX reservation considerations may apply when running with multiple devices. For more information, refer to this FAQ.
NVMe-RDMA Full Offload Mode
The NVMe-RDMA full offload mode allows reducing the Arm cores CPU cost by offloading the data-path directly to the firmware/hardware. If the user needs to control the data plane or the backend this mode does not allow it.
In full offload mode the control plane is handled at the SW level, while the data plane is being handled at the FW level and requires no SW interaction. For that reason, the user has no control over the backend devices. Thus they are detected automatically and no namespace management commands are required.
The NVMe-RDMA architecture:
In this mode, a remote target parameter must be provided using a JSON configuration file (a JSON file example can be found in /etc/mlnx_snap/mlnx_snap_offload.json.example
) and the NVMe controller can detect and connect to the relevant backends by itself.
As the SNAP application does not participate in the datapath and needs fewer resources, it is recommended to reduce CPU_MASK to a single core (i.e., CPU_MASK=0x80
). Refer to "Increasing Number of Used Arm Cores" for CPU_MASK
configuration.
After configuration is done, users must create the NVMe subsystem and (offloaded) controller. Note that snap_rpc.py controller_nvme_namespace_attach
is not required and --rdma_device mlx5_2
is provided to mark the relevant RDMA interface for the connection.
The following example creates an NVMe full-offload controller:
# snap_rpc.py subsystem_nvme_create "Mellanox_NVMe_SNAP" "Mellanox NVMe SNAP Controller"
# snap_rpc.py controller_nvme_create mlx5_0 --subsys_id 0 --pci_bdf 88:00.2 --nr_io_queues 32 --mdts 4 -c /etc/mlnx_snap/mlnx_snap.json --rdma_device mlx5_2
This is the matching JSON file example:
{
"ctrl": {
"offload": true,
},
"backends": [
{
"type": "nvmf_rdma",
"name": "testsubsystem",
"paths": [
{
"addr": "1.1.1.1",
"port": 4420,
"ka_timeout_ms": 15000,
"hostnqn": "r-nvmx03"
}
]
}
]
}
Full offload mode requires that the provided RDMA device (given in --rdma_device
parameter) supports RoCE transport (typically SF interfaces). Full offload mode for virtio-blk is not supported.
The discovered namespace ID may be remapped to get another ID when exposed to the host in order to comply with firmware limitations.
SR-IOV
SR-IOV configuration depends on the kernel version:
Optimal configuration may be achieved with a new kernel in which the
sriov_drivers_autoprobe sysfs
entry exists in/sys/bus/pci/devices/
/ Otherwise, the minimal requirement may be met if the
sriov_totalvfs sysfs
entry exists in/sys/bus/pci/devices/
/
SR-IOV configuration needs to be done on both the host and DPU side, marked in the following example as [HOST] and [ARM] respectively. This example assumes that there is 1 VF on static virtio-blk PF 86:00.3 (NVMe flow is similar), and that a Malloc0
SPDK BDEV exists.
Optimal Configuration
[ARM] snap_rpc.py controller_virtio_blk_create mlx5_0 -d 86:00.3 --bdev_type none
[HOST] modprobe -v virtio-pci && modprobe -v virtio-blk
[HOST] echo 0 > /sys/bus/pci/devices/0000:86:00.3/sriov_drivers_autoprobe
[HOST] echo 1 > /sys/bus/pci/devices/0000:86:00.3/sriov_numvfs [ARM] snap_rpc.py controller_virtio_blk_create mlx5_0 --pf_id 0 --vf_id 0 --bdev_type spdk --bdev Malloc0
\* Continue by binding the VF PCIe function to the desired VM. *\
After configuration is finished, no disk is expected to be exposed in the hypervisor. The disk only appears in the VM after the PCIe VF is assigned to it using the virtualization manager. If users want to use the device from the hypervisor, they simply need to bind the PCIe VF manually.
Minimal Requirement
[ARM] snap_rpc.py controller_virtio_blk_create mlx5_0 -d 86:00.3 --bdev_type none
[HOST] modprobe -v virtio-pci && modprobe -v virtio-blk
[HOST] echo 1 > /sys/bus/pci/devices/0000:86:00.3/sriov_numvfs
\* the host now hangs until configuration is performed on the DPU side *\ [ARM] snap_rpc.py controller_virtio_blk_create mlx5_0 --pf_id 0 --vf_id 0 --bdev_type spdk --bdev Malloc0
\* Host is now released *\
\* Continue by binding the VF PCIe function to the desired VM. *\
Hotplug PFs do not support SR-IOV.
It is recommended to add pci=assign-busses
to the boot command line when creating more than 127 VFs.
Without this option, the following errors may appear from host, and the virtio driver will not probe these devices.
pci 0000:84:00.0: [1af4:1041] type 7f class 0xffffff
pci 0000:84:00.0: unknown header type 7f, ignoring device
Zero Copy (SNAP-direct)
Zero-copy is supported on SPDK 21.07 and higher.
The SNAP-direct feature allows SNAP applications to transfer data directly from the host memory to remote storage without using any staging buffer inside the DPU.
SNAP enables the feature according to the SPDK BDEV configuration only when working against an SPDK NVMe-oF RDMA block device.
To configure the controller to use Zero Copy, set the following in /etc/default/mlnx_snap
:
For virtio-blk:
VIRTIO_BLK_SNAP_ZCOPY=1
For NVMe:
NVME_SNAP_ZCOPY=1
NVMe/TCP Zero Copy
NVMe/TCP Zero Copy is implemented as a custom NVDA_TCP transport in SPDK NVMe initiator and it is based on a new XLIO socket layer implementation.
The implementation is different for Tx and Rx:
The NVMe/TCP Tx Zero Copy is similar between RDMA and TCP in that the data is sent from the host memory directly to the wire without an intermediate copy to Arm memory
The NVMe/TCP Rx Zero Copy allows achieving partial zero copy on the Rx flow by eliminating copy from socket buffers (XLIO) to application buffers (SNAP). But data still must be DMA'ed from Arm to host memory.
To enable NVMe/TCP Zero Copy, use SPDK v22.05.nvda --with-xlio
.
For more information about XLIO including limitations and bug fixes, refer to the NVIDIA Accelerated IO (XLIO) Documentation.
To configure the controller to use NVMe/TCP Zero Copy, set the following in /etc/default/mlnx_snap
:
EXTRA_ARGS="-u –mem-size 1200 –wait-for-rpc"
NVME_SNAP_TCP_RX_ZCOPY=1
SPDK_XLIO_PATH=/usr/lib/libxlio.so
MIN_HUGEMEM=4G
To connect using NVDA_TCP transport:
If
/etc/mlnx_snap/spdk_rpc_init.conf
is not being used, add the following at the start of the file in the given order:sock_set_default_impl -i xlio framework_start_init
When the
mlnx_snap
service is started, run the following command:[ARM] spdk_rpc.py bdev_nvme_attach_controller -b <NAME> -t NVDA_TCP -f ipv4 -a <IP> -s <PORT> -n <SUBNQN>
If
/etc/mlnx_snap/spdk_rpc_init.conf
is not being used, once the service is started, run the following commands in the given order:[ARM] spdk_rpc.py sock_set_default_impl -i xlio [ARM] spdk_rpc.py framework_start_init [ARM] spdk_rpc.py bdev_nvme_attach_controller -b <NAME> -t NVDA_TCP -f ipv4 -a <IP> -s <PORT> -n <SUBNQN>
NVDA_TCP transport is fully interoperable with other implementations based on the NVMe/TCP specifications.
NVDA_TCP limitations:
SPDK multipath is not supported
NVMe/TCP data digest is not supported
SR-IOV is not supported
Robustness and Recovery
As SNAP is a standard user application running on the DPU OS, it is vulnerable to system interferences, like closing SNAP application gracefully (i.e., stopping mlnx_snap service), killing SNAP process brutally (i.e., running kill -9
), or even performing full OS restart to DPU. If there are exposed devices already in use by host drivers when any of these interferences occur, that may cause the host drivers/application to malfunction.
To avoid such scenarios, the SNAP application supports a "Robustness and Recovery" option. So, if the SNAP application gets interrupted for any reason, the next instance of the SNAP application will be able to resume where the previous instance left off.
This functionality can be enabled under the following conditions:
Only virtio-blk devices are used (this feature is currently not supported for NVMe protocol)
By default, SNAP application is programmed to survive any kind of "graceful" termination, including controller deletion, service restart, and even (graceful) Arm reboot. If extended protection against brutal termination is required, such as sending SIGKILL to SNAP process or performing brutal Arm shutdown, the
--force_in_order
flag must be added to thesnap_rpc.py controller_virtio_blk_create
command.NoteThe
force_in_order
flag may impact performance if working with remote targets as it may cause high rates of out-of-order completions or if different queues are served in different rates.It is the user's responsibility to open the recovered virtio-blk controller with the exact same characteristics as the interrupted virtio-blk controller (same remote storage device, same BAR parameters, etc.)
Mem-pool
By default, SNAP application pre-allocates all required memory buffers in advance.
A great amount of allocated memory may be required when using:
Large number of controllers (as with SR-IOV)
Large number of queues per controller
High queue-depth
Large mdfs (for NVMe) or
seg_max
andsize_max
(for virtio-blk)
To reduce the memory footprint of the application, users may choose to use mem-pool (a shared memory buffer pool) instead. However, using mem-pool may decrease overall performance.
To configure the controller to use mem-pool rather than private ones:
In
/etc/default/mlnx_snap
, set the parameterMEM_POOL_SIZE
to a non-zero value. This parameter accepts K/M/G notations (e.g.,MEM_POOL_SIZE=100M
). If K/M/G notation is not specified, the value defaults to bytes.Users must choose the right value for their needs—a value too small may cause longer starvations, while a value too large consumes more memory. As a rule of thumb, typical usage may choose to set it as a minimum (num_devices*4MB, 512MB).
Upon controller creation, add the option
--mempool
. For example:snap_rpc.py controller_nvme_create mlx5_0 —subsys_id 0 —pf_id 0 --mem pool
NoteThe per controller mem-pool configuration is independent from all others. Users can set some controllers to work with mem-pool and other controllers to work without it.
Virtio-blk Transitional Device (0.95)
SNAP supports virtio-blk transitional devices. Virtio transitional devices refer to devices supporting drivers conforming to modern specification and legacy drivers (conforming to legacy 0.95 specifications).
To configure virtio-blk PCIe functions to be transitional devices, special firmware configuration parameters must be applied:
VIRTIO_BLK_EMULATION_PF_PCI_LAYOUT
(0: MODERN / 1: TRANSITIONAL) – configures transitional device support for PFsNoteThis parameter is currently not supported.
VIRTIO_BLK_EMULATION_VF_PCI_LAYOUT
(0: MODERN / 1: TRANSITIONAL) – configures transitional device support for underlying VFsNoteThis parameter is currently not supported.
VIRTIO_EMULATION_HOTPLUG_TRANS
(True/False) – configures transitional device support for hot-plugged virtio-blk devices
To use virtio-blk transitional devices, Linux boot parameters must be set on the host:
If the kernel version is older than 5.1, set the following Linux boot parameter on the host OS:
intel_iommu=off
If
virtio_pci
is built-in from host OS, set the following Linux boot parameter:virtio_pci.force_legacy=
1
If
virtio_pci
is a kernel module rather than built-in from host OS, use force legacy to load the module:modprobe -rv virtio_pci modprobe -v virtio_pci force_legacy=
1
For hot-plugged functions, additional configuration must be applied during SNAP hotplug operation:
# snap_rpc.py emulation_device_attach mlx5_0 virtio_blk -- transitional_device --bdev_type spdk --bdev BDEV
Virtio-blk Live Migration
Live migration is a standard process supported by QEMU which allows system administrators to pass devices between virtual machines in a live running system. For more information, refer to QEMU VFIO device Migration documentation.
Live migration is supported for SNAP virtio-blk devices. It can be activated using a driver with proper support (e.g., NVIDIA's proprietary VDPA-based Live Migration Solution). For more info, refer to TBD.
If the physical function (PF) has been removed, for instance, using vDPA provisioning virtio-blk PF with the command:
python ./app/vfe-vdpa/vhostmgmt mgmtpf -a 0000:af:00.3
It is advisable to confirm and restore the presence of controllers in SNAP before attempting to re-add them using the command:
python dpdk-vhost-vfe/app/vfe-vdpa/vhostmgmt vf -v /tmp/sock-blk-0 -a 0000:59:04.5
The following procedure is designed for live deployment of small software bug fixes or modifications made in the SNAP application. Using this procedure for other purposes (e.g., bumping SNAP service to a new version on top of an older BFB image) may cause SNAP to malfunction.
To live upgrade SNAP, 2 SNAP processes must be opened in parallel.
All system resources (e.g., hugepages, memory) must be sufficient to temporarily support 2 SNAP application instances operating in parallel during the upgrade procedure.
Passing virtio-blk Controller's Management Between SNAP Processes
Open 2 SNAP processes simultaneously on the Arm.
NoteThis requires changing the SPDK RPC server path.
InfoFor lower downtime, it is highly recommended to run each process on a different CPU mask.
For SNAP Process 1, run:
./mlnx_snap_emu -m 0xf0 -r /var/tmp/spdk.sock1
For SNAP Process 2, run:
./mlnx_snap_emu -m 0x0f -r /var/tmp/spdk.sock2
Connect to the same bdev with both processes (i.e., with Malloc device).
For SNAP Process 1, run:
spdk_rpc.py -s /var/tmp/spdk.sock1 bdev_malloc_create -b Malloc1 1024 512
For SNAP Process 2, run:
spdk_rpc.py -s /var/tmp/spdk.sock2 bdev_malloc_create -b Malloc1 1024 512
Open a virtio-blk controller on the SNAP Process 1:
snap_rpc.py -s /var/tmp/spdk.sock1 controller_virtio_blk_create mlx5_0 --pf_id 0 --bdev_type spdk --bdev Malloc1 --num_queues 16
Load virtio-blk driver on the host side and start using it.
Delete the virtio-blk controller instance from SNAP Process 1 and immediately open a virtio-blk controller on SNAP Process 2:
snap_rpc.py -s /var/tmp/spdk.sock1 controller_virtio_blk_delete VblkEmu0pf0 --force && snap_rpc.py -s /var/tmp/spdk.sock2 controller_virtio_blk_create mlx5_0 --pf_id 0 --bdev_type spdk --bdev Malloc1 --num_queues 16
Full "Live Upgrade" Procedure
Assuming there exists a fully configured SNAP service is already running on the system:
Create a local copy of SNAP binary file (e.g., under
/tmp
folder):cp /usr/bin/mlnx_snap_emu /tmp/
For all active virtio-blk controllers, follow management passing procedure as described in section "Passing virtio-blk Controller's Management Between SNAP Processes".
Stop original SNAP service.
systemctl stop mlnx_snap
Upgrade SNAP service.
If installed from binary, use Linux official installation framework (
apt
/yum
)If installed from sources, follow the same installation process as done originally
Repeat management passing procedure, this time to move back control from the local copy into the official (updated) version of SNAP service.
With Linux environment on host OS, additional kernel boot parameters may be required to support SNAP related features:
To use SR-IOV,
intel_iommu=on iommu=ptmust
be addedTo use PCIe hotplug,
pci=realloc
must be addedWhen using SR-IOV,
pci=assign-busses
must be added
To view boot parameter values, use the command cat /proc/cmdline
.
SPDK backend (BDEV) management commands:
spdk_rpc.py bdev_nvme_attach_controller -b <name> -t rdma -a <ip> -f ipv4 -s <port> -n <nqn>
spdk_rpc.py bdev_nvme_detach_controller <name>
spdk_rpc.py bdev_null_create <name> <size_mb> <blk_size>
spdk_rpc.py bdev_null_delete <name>
spdk_rpc.py bdev_aio_create <filepath> <name> <blk_size>
spdk_rpc.py bdev_aio_delete <name>
For more information, please refer to SPDK BDEV documentation.
Before configuring mlnx_snap, the user must ensure all FW configuration requirements are met. By default, mlnx_snap is disabled, and needs to be enabled by running both common mlnx-snap configuration, and additional protocol-specific configuration depending on the expected usage of the application (e.g. Hotplug, SR-IOV, UEFI boot, etc).
After all configuration is finished, power-cycling the host is required for these changes to take effect.
To verify that all configuration requirements are satisfied, users may query the current/next configuration by running the following:
mlxconfig -d /dev/mst/mt41686_pciconf0 -e query
Basic Configuration
(Optional) Reset all previous configuration.
[dpu] mst start [dpu] mlxconfig -d /dev/mst/mt41686_pciconf0 reset
NoteThis will return your product to its default configurations. Do this only if you were not able to get SNAP to work.
Set general basic parameters.
On BlueField-2:
[dpu] mlxconfig -d /dev/mst/mt41686_pciconf0 s INTERNAL_CPU_MODEL=1
On BlueField:
[dpu] sudo mlxconfig -d /dev/mst/mt41682_pciconf0 s INTERNAL_CPU_MODEL=1 PF_BAR2_ENABLE=1 PF_BAR2_SIZE=1
When using RDMA/RoCE transport, additional parameters must be configured:
[dpu] mlxconfig -d /dev/mst/mt41686_pciconf0 s PER_PF_NUM_SF=
1
[dpu] mlxconfig -d /dev/mst/mt41686_pciconf0 s PF_SF_BAR_SIZE=8
PF_TOTAL_SF=2
[dpu] mlxconfig -d /dev/mst/mt41686_pciconf0.1
s PF_SF_BAR_SIZE=8
PF_TOTAL_SF=2
System Configuration Parameters
Parameter |
Description |
Possible Values |
Comments |
|
Enable SR-IOV |
0/1 |
|
|
Number of VFs per emulated PF |
[0-127] |
|
|
Number of MSIX assigned to emulated PF |
[0-63] |
|
|
Number of MSIX assigned to emulated VF |
[0-63] |
|
|
Enable PCIe switch for emulated PFs |
0/1 |
|
|
Max number of emulated PFs |
[0-32] |
Single port is reserved for all static PFs |
SRIOV_EN is valid only for static PFs
NVMe Configuration
Parameter |
Description |
Possible Values |
Comments |
|
Enable NVMe device emulation |
0/1 |
|
|
Number of static emulated NVMe PFs |
[0-2] |
|
|
Number of MSIX assigned to emulated NVMe PF/VF |
[0-63] |
|
|
Number of VFs per emulated NVMe PF |
[0-127] |
If not 0, overrides |
|
Enable NVMe UEFI exprom driver |
0/1 |
Used for UEFI boot process |
Virtio-blk Configuration
Due to virtio-blk protocol limitations, using bad configuration while working with static virtio-blk PFs may cause the host server OS to fail on boot.
Before continuing, make sure you have configured:
A working channel to access Arm even when the host is shut down. Setting such channel is out of the scope of this document. Please refer to "NVIDIA BlueField BSP documentation" for more details.
Add the following line to
/etc/mlnx_snap/snap_rpc_init.conf
:controller_virtio_blk_create mlx5_0 --pf_id 0 --bdev_type none
For more information, please refer to section “Virtio-blk Controller Management”.
Parameter |
Description |
Possible Values |
Comments |
|
Enable virtio-blk device emulation |
0/1 |
|
|
Number of static emulated virtio-blk PFs |
[0-2] |
See WARNING above |
|
Number of MSIX assigned to emulated virtio-blk PF/VF |
[0-63] |
|
|
Number of VFs per emulated virtio-blk PF |
[0-127] |
If not 0, overrides |
|
Enable virtio-blk UEFI exprom driver |
0/1 |
Used for UEFI boot process |
This section is relevant only for the following cases:
Using legacy mode in which the user prefers to not use the recommended SNAP commands, but to use the JSON file format
NVMe-RDMA full offload mode in which the configuration is only possible with the JSON file format
The configuration parameters are divided into two categories: Controller and backends.
Configuration File Examples
Legacy Mode Configuration
For the non-full offload mode, it is recommended to use the SNAP RPC commands (described in SNAP Commands) and not the legacy mode of the JSON file format described in this section.
{
"ctrl": {
"func_num": 0,
"rdma_device": "mlx5_2",
"sqes": 0x6,
"cqes": 0x4,
"cq_period": 3,
"cq_max_count": 6,
"nr_io_queues": 32,
"mn": "Mellanox BlueField NVMe SNAP Controller",
"sn": "MNC12",
"mdts": 4,
"oncs": 0,
"offload": false,
"max_namespaces": 1024,
"quirks": 0x0,
"version": "1.3.0"
},
"backends": [
{
"type": "spdk_bdev",
"paths": [
{
}
]
}
]
}
NVMe-RDMA Full Offload Mode Configuration
For NVMe-RDMA full offload mode, users can only use the JSON file format (and not the SNAP RPC commands).
{
"ctrl": {
"func_num": 0,
"rdma_device": "mlx5_2",
"sqes": 0x6,
"cqes": 0x4,
"cq_period": 3,
"cq_max_count": 6,
"nr_io_queues": 32,
"mn": "Mellanox BlueField NVMe SNAP Controller",
"sn": "MNC12",
"mdts": 4,
"oncs": 0,
"offload": true,
"max_namespaces": 1024,
"quirks": 0x0
"version": "1.3.0"
},
"backends": [
{
"type": "nvmf_rdma",
"name": "testsubsystem",
"paths": [
{
"addr": "1.1.1.1",
"port": 4420,
"ka_timeout_ms": 15000,
"hostnqn": "r-nvmx03"
}
]
}
]
}
Configuration Parameters
Controller Parameters
Parameters in SNAP JSON configuration file. Default file is located in /etc/mlnx_snap/mlnx_snap.json
.
Parameter |
Description |
Legal Values |
Default |
|
Describes the RPC server socket for passing through RPC commands. Relevant only when using vendor-specific RPC commands from host. |
Any |
|
|
Enable full-offload mode |
|
|
|
Maximum number of I/O queues. Note
The actual number of queues is limited by number of queues supported by FW.
|
≥ 0 |
|
|
Model number |
String (up to 40 chars) |
|
|
Serial number |
String (up to 20 chars) |
|
|
Number of namespaces (NN) indicates the maximum value of a valid NSID for the NVM subsystem. If the |
0-0xFFFFFFFE |
|
|
Maximum number of allowed namespaces (MNAN) supported by the NVM subsystem |
1-0xFFFFFFFE |
|
|
Max data transfer size. This value is in units of the minimum memory page size (CAP.MPSMIN) and is reported as a power of two (2n). A value of 0h indicates that there is no maximum data transfer size. |
1-6 |
|
|
Bitmask for enabling specific NVMe driver quirks in order to work with non-NVMe spec compliant drivers.
|
|
|
|
Limit number of available namespaces |
Any |
|
Backend Parameters
Theses parameters are used to define the backend server.
Even though a list of backends can be configured, currently only a single backend is supported.
Parameter |
Description |
Legal Values |
Default |
|
Backend type:
|
|
|
|
Depends on backend type:
|
Any |
Null |
|
Represents the desired size (in MB) of the opened backend Note
Relevant only for
|
Any |
Unused |
|
Represents the desired block size (in logarithmic scale) of the opened backend Note
Relevant only for
|
9, 12 |
Unused |
Path Section
This section is relevant only if backend type is set to nvmf_rdma
. For each backend, a list of paths can be specified using the following parameters:
Parameter |
Description |
Legal Values |
Default |
|
Target IPv4 address |
String in |
|
|
Target port number |
1024-65534 |
|
|
Keepalive timeout in msec |
>0 |
|
|
Host NQN |
String up to 223-char long |
|
How do I enable SNAP?
Please refer to section "SNAP Installation".
How do I configure SNAP to support VirtIO-blk?
Please refer to section "Virtio-blk Configuration".
How do I configure SNAP to work with both ports (for the same or for multiple targets)?
Assumptions:
The remote target is configured with nqn "Test" and 1 namespace, and it exposes it through the 2 RDMA interfaces 1.1.1.1/24 and 2.2.2.1/24
The RDMA interfaces are 1.1.1.2/24 and 2.2.2.2/24
Non-offload mode configuration:
Create the SPDK BDEVS. Run:
spdk_rpc.py bdev_nvme_attach_controller -b Nvme0 -t rdma -a 1.1.1.1 -f ipv4 -s 4420 -n Test spdk_rpc.py bdev_nvme_attach_controller -b Nvme1 -t rdma -a 2.2.2.1 -f ipv4 -s 4420 -n Test
Create NVMe controller. Run:
snap_rpc.py controller_nvme_create mlx5_0 --subsys_id 0 -c /etc/mlnx_snap/mlnx_snap.json --rdma_device mlx5_2
Attach the namespace twice, one through each port. Run:
snap_rpc.py controller_nvme_namespace_attach -c NvmeEmu0pf0 spdk Nvme0n1 1 snap_rpc.py controller_nvme_namespace_attach -c NvmeEmu0pf0 spdk Nvme1n1 2
At this stage, you should see /dev/nvme0n1
and /dev/nvme0n2
on the host "nvme list", both of which are mapped to the same remote disk via 2 different ports.
Full-offload mode configuration:
Full-offload mode currently allows users to connect to multiple remote targets in parallel (but not to the same remote target through different paths).
Create 2 separate JSON full-offload configuration files (see section "NVMe-RDMA Full Offload Mode Configuration"). Each describe a connection to remote target via different RDMA interface.
Configure 2 separate NVMe device entries to be exposed to the host either as hot-plugged PCIe functions or “static” ones (see section "Firmware Configuration").
Create 2 NVMe controllers, one per RDMA interface. Run:
snap_rpc.py subsystem_nvme_create Mellanox_NVMe_SNAP "Mellanox NVMe SNAP Controller" snap_rpc.py controller_nvme_create mlx5_0 --subsys_id 0 --pf_id 0 -c /etc/mlnx_snap/mlnx_snap_p0.json --rdma_device mlx5_2 snap_rpc.py subsystem_nvme_create Mellanox_NVMe_SNAP "Mellanox NVMe SNAP Controller" snap_rpc.py controller_nvme_create mlx5_0 --subsys_id 1 --pf_id 1 -c /etc/mlnx_snap/mlnx_snap_p1.json --rdma_device mlx5_3
NoteNVMe controllers may also share the same NVMe subsystem. In this case, users must make sure all namespaces in all remote targets have a distinct NSID.
At this stage, you should see /dev/nvme0n1
and /dev/nvme1n1
on the host nvme list
.
How do I configure offload mode? Which protocols are supported?
Please refer to section "NVMe-RDMA Full Offload Mode Configuration".
For more information on full offload, please refer to section "NVMe-RDMA Full Offload Mode".
How do I configure Firmware for SNAP?
Please refer to section "Firmware Configuration".
Can I work with custom SPDK on Arm?
MLNX SNAP is natively compiled against NVIDIA's internal branch of SPDK. It is possible to work with different SPDK versions, under the following conditions:
mlnx-snap sources must be recompiled against the new SPDK sources
The new SPDK version changes do not break any external SPDK APIs
Integration process:
Build SPDK (and DPDK) with shared libraries.
[spdk.git] ./configure --prefix=/opt/mellanox/spdk-custom --disable-tests --disable-unit-tests --without-crypto --without-fio --with-vhost --without-pmdk --without-rbd --with-rdma --with-shared --with-iscsi-initiator --without-vtune --without-isal [spdk.git] make && sudo make install [spdk.git] cp -r dpdk/build/lib/* /opt/mellanox/spdk-custom/lib/ [spdk.git] cp -r dpdk/build/include/* /opt/mellanox/spdk-custom/include/
NoteIt is also possible to install DPDK in that directory but copying suffices.
NoteOnly the flag
with-shared
is mandatoryBuild SNAP against the new SPDK.
[mlnx-snap.src] ./configure --with-snap --with-spdk=/opt/mellanox/spdk-custom --without-gtest --prefix=/usr [mlnx-snap.src] make -j8 && sudo make install
Append additional custom libraries to the mlnx-snap application. Set
LD_PRELOAD="/opt/mellanox/spdk/lib/libspdk_custom_library.so"
.NoteAdditional SPDK/DPDK libraries required by
libspdk_custom_library.so
might also need to be attached toLD_PRELOAD
.NoteLD_PRELOAD
setting can be added to/etc/default/mlnx_snap
for persistent work with the mlnx_snap system service.Run application.
Can I replace my backend storage at runtime?
NVMe protocol has an embedded support for backends (namespaces) attach/detach at runtime.
To change backend storage during runtime for NVMe, run:
snap_rpc.py controller_nvme_namespace_detach -c NvmeEmu0pf0 1
snap_rpc.py controller_nvme_namespace_attach -c NvmeEmu0pf0 spdk nvme0n1 1
Virtio-blk does not have similar support in its protocol’s specification. Therefore, detaching while running IO results in error on any IO received between the request to detach and attach.
To change backend storage at runtime for virtio-blk, run:
snap_rpc.py controller_virtio_blk_bdev_detach VblkEmu0pf0
snap_rpc.py controller_virtio_blk_bdev_attach VblkEmu0pf0 spdk nvme0n1
I'm suffering from low performance after updating to latest mlnx-snap. How can I fix it?
After adding the option to work with a large number of controllers, resource considerations had to be considered. It was necessary to pay special attention to the MSIX resource, which is limited to ~1K across the whole BlueField-2 card. Therefore, new PCI functions are now opened with limited resources by default (specifically, MSIX is set to 2).
User may choose to assign more resources for a specific function, as detailed in the following:
Increase the number of MSIX allowed to be assigned to a function (power-cycle may be required for changes to take effect):
[dpu] mlxconfig -d /dev/mst/mt41686_pciconf0 s VIRTIO_BLK_EMULATION_NUM_MSIX=63
Hotplug virtio-blk PF with the increased value of MSIX.
[dpu] snap_rpc.py emulation_device_attach mlx5_0 virtio_blk --num_msix=63
Open the controller with increased number of queues (1 queue per MSIX, and leave another free MSIX for configuration interrupts):
[dpu] snap_rpc.py controller_virtio_blk_create mlx5_0 --pf_id 0 --bdev_type spdk --bdev Null0 --num_queues=62
For more information, please refer to section "Performance Optimization".