Virtio-blk

Preface

This page intends to assist SNAP users and developers in troubleshooting and resolving common issues when working with SNAP containers or source packages.

It is recommended to consult this page only after reviewing the latest BlueField SNAP for NVMe and Virtio-blk Documentation.

Verbosity Level

SNAP allows dynamically changing the log level of the logger backend using the snap_log_level_set. Any log under the requested level is shown.

Parameter	Mandatory?	Type	Description
`level`	Yes	Number	Log level 0 – Critical 1 – Error 2 – Warning 3 – Info 4 – Debug 5 – Trace

Parameter

Mandatory?

Type

Description

level

Yes

Number

Log level

0 – Critical
1 – Error
2 – Warning
3 – Info
4 – Debug
5 – Trace

Examine the logs of the SNAP container:

Copy
Copied!

            
            crictl logs -f $(crictl ps -s running -q --name snap)

SNAP RPC Help

Use snap_rpc.py --help to review the supported RPCs and learn the parameters per command.

SNAP State Query

List Emulation Devices

Use the following command to list emulation devices:

Copy
Copied!

            
            snap_rpc.py emulation_function_list

For detailed information, refer to emulation_function_list.

List Virtio-Blk Controllers and Their Configurations

To list Virtio-Blk controllers and view the configurations and state for each one, use the following command:

Copy
Copied!

            
            snap_rpc.py virtio_blk_controller_list

For detailed information, refer to virtio_blk_controller_list.

List NVMe Subsystems, Controllers and Namespaces

To list NVMe subsystems, controllers, and namespaces under each subsystem, including their configurations and state, use the following command:

Copy
Copied!

            
            snap_rpc.py nvme_subsystem_list

For detailed information, refer to nvme_subsystem_list.

List NVMe Controllers Including Configurations and State

To list NVMe controllers along with their configurations and state, use the following command:

Copy
Copied!

            
            snap_rpc.py nvme_controller_list

For detailed information, refer to nvme_controller_list.

List NVMe Namespaces

To list NVMe namespaces, use the following command:

Copy
Copied!

            
            snap_rpc.py nvme_namespace_list

For detailed information, refer to nvme_namespace_list.

List SPDK BDevs

To list SPDK BDevs, use the following command:

Copy
Copied!

            
            spdk_rpc.py bdev_get_bdevs

For detailed information, refer to SPDK JSON-RPC Documentation.

RPC log history (enabled by default) records all the RPC requests (from snap_rpc.py and spdk_rpc.py) sent to the SNAP application and the RPC response for each RPC requests in a dedicated log file, /var/log/snap-log/rpc-log. This file is visible outside the container (i.e., the log file's path on the DPU is /var/log/snap-log/rpc-log as well).

The SNAP_RPC_LOG_ENABLE env can be used to enable (1) or disable (0) this feature.

Info

RPC log history is supported with SPDK version spdk23.01.2-12 and above.

Warning

When RPC log history is enabled, the SNAP application writes (in append mode) RPC request and response message to /var/log/snap-log/rpc-log constantly. Pay attention to the size of this file. If it gets too large, delete the file on the DPU before launching the SNAP pod.

SNAP IO Level Statistics

Debug counters provide I/O statistics for each controller, offering insights into the distribution of I/O across different queues and the total I/O received by the controller.

For Virtio-Blk I/O statistics, use: snap_rpc.py virtio_blk_controller_dbg_io_stats_get (Refer to virtio_blk_controller_dbg_io_stats_get)
For NVMe I/O statistics, use: snap_rpc.py nvme_controller_dbg_io_stats_get (Refer to nvme_controller_dbg_io_stats_get)

Debug Info Package

N/A

Scenarios

Info

For details on known bugs and limitations, please refer to the SNAP Known Issues.

SPDK and SNAP Compatibility Issues

Each SNAP container release is bundled with the latest available NVIDIA SPDK. If you need to replace the SPDK version with a custom one, follow the instructions provided here:

For the SNAP container: Building SNAP Container with Custom SPDK
For the SNAP source package: Replacing the BFB SPDK in SNAP Deployment

If a build failure occurs due to compatibility issues between SNAP and SPDK, the /service/compat/spdk folder contains the necessary infrastructure to address these compatibility issues.

SNAP Service Fails to Load Due to a Incorrect Firmware Configuration

If the firmware is not configured for NVMe or Virtio-blk, an error log message appears when attempting to load SNAP.

To enable the storage emulation based on your desired configuration, follow the instructions provided in Firmware Configuration.

Fewer Queues Created than Configured

By default, SNAP attempts to create the maximum number of queues within the MSIX limitation (up to 63).

If a higher number of queues are requested during controller creation, they are still subject to the MSIX limitation.

For configuring the number of MSIX entries, refer to: DPU Firmware Configuration

To dynamically manage MSIX, refer to: SR-IOV Dynamic MSIX Management

IO Failures During High Throughput with NVMeTCP Using XLIO

The default XLIO TCP configuration required for NVMeTCP is included in the SNAP container or source package. However, when scaling up tests, IO failures may occur specifically when using NVMeTCP. It is recommended to consult the Monitoring, Debugging, and Troubleshooting section of the NVIDIA Accelerated IO (XLIO) Documentation for guidance.

Deploying Container on Setups Without Internet

To deploy a container in environments without internet access, refer to Deploying Container on Setups Without Internet Connectivity.

Managing SNAP Service Memory Consumption

DPU memory is shared among all DPU services, and scaling the SNAP configuration may lead to memory shortages.

To understand SNAP memory configuration and usage, please refer to SNAP Memory Consumption.

Container Image Corruption

If the container image becomes corrupted and the container status shows as as exited with the error message /usr/bin/supervisord: exec format error, follow these steps:

Remove the YAML from kubelet.
Usecrictl images to list the images and crictl rmi <image-id> to remove the image.
Restart the containerd and kubelet services with systemctl restart containerd and systemctl restart kubelet, respectively.
Reapply/copy the YAML file to kubelet.

For additional information on container deployment and debugging, refer to SNAP Container Deployment.

Host OS/kernel version, use: uname -a and cat /etc/os-release
Host CPU model, use: lscpu
Host commands:
- Driver load/unload commands
- Functions management (VF assignment, SR-IOV commands, FLR events)
- Storage application commands (e.g., fio testing app command)
Host dmesgoutput

From the BlueField Platform

HW model, use:/hpc/local/bin/lshca
BFB version, use: cat /etc/os-release
SPDK version (if a non-default version is used)
XLIO version (if a non-default version is used)
FW version, use: mlxfwmanager
FW configuration (refer to SR-IO Firmware Configuration)
Hugepage memory configuration (Refer to Allocate Hugepages under /etc/sysctl.conf)
SNAP container configuration (Refer to YAML file)
SNAP state, using the output of the following RPCs:
- snap_rpc.py emulation_function_list (refer to emulation_function_list)
- snap_rpc.py virtio_blk_controller_list (in case of Virtio-Blk, refer to virtio_blk_controller_list)
- snap_rpc.py nvme_subsystem_list (in case of NVMe, refer to nvme_subsystem_list )
- spdk_rpc.py bdev_get_bdevs (see SPDK JSON-RPC Documentation)
The SNAP and SPDK initialization RPCs (defined in etc/nvda_snap/spdk_init.conf and etc/nvda_snap/snap_init.conf)
SNAP RPC logs (check the SNAP RPC log history, refer to RPC Log History)
SNAP logs
Info
- To export container logs if the container crashes, use ls /var/log/containers/. Logs are prefixed with the container ID (e.g., 94cdeb21b031b), visible under crictl ps. If the container restarts, you may find multiple logs.
Additional Logs:
- Kubernetes logs, use: journalctl -u kubelet > #log_file_name
- DPU logs, check /var/log/messages and /var/log/dmesg

On This Page