Virtio-blk
This page intends to assist SNAP users and developers in troubleshooting and resolving common issues when working with SNAP containers or source packages.
It is recommended to consult this page only after reviewing the latest BlueField SNAP for NVMe and Virtio-blk Documentation.
Verbosity Level
SNAP allows dynamically changing the log level of the logger backend using the snap_log_level_set
. Any log under the requested level is shown.
Parameter |
Mandatory? |
Type |
Description |
|
Yes |
Number |
Log level
|
Examine the logs of the SNAP container:
crictl logs -f $(crictl ps -s running -q --name snap)
SNAP RPC Help
Use snap_rpc.py --help
to review the supported RPCs and learn the parameters per command.
SNAP State Query
List Emulation Devices
Use the following command to list emulation devices:
snap_rpc.py emulation_function_list
For detailed information, refer to emulation_function_list.
List Virtio-Blk Controllers and Their Configurations
To list Virtio-Blk controllers and view the configurations and state for each one, use the following command:
snap_rpc.py virtio_blk_controller_list
For detailed information, refer to virtio_blk_controller_list.
List NVMe Subsystems, Controllers and Namespaces
To list NVMe subsystems, controllers, and namespaces under each subsystem, including their configurations and state, use the following command:
snap_rpc.py nvme_subsystem_list
For detailed information, refer to nvme_subsystem_list.
List NVMe Controllers Including Configurations and State
To list NVMe controllers along with their configurations and state, use the following command:
snap_rpc.py nvme_controller_list
For detailed information, refer to nvme_controller_list.
List NVMe Namespaces
To list NVMe namespaces, use the following command:
snap_rpc.py nvme_namespace_list
For detailed information, refer to nvme_namespace_list.
List SPDK BDevs
To list SPDK BDevs, use the following command:
spdk_rpc.py bdev_get_bdevs
For detailed information, refer to SPDK JSON-RPC Documentation.
RPC Log History
RPC log history (enabled by default) records all the RPC requests (from snap_rpc.py
and spdk_rpc.py
) sent to the SNAP application and the RPC response for each RPC requests in a dedicated log file, /var/log/snap-log/rpc-log
. This file is visible outside the container (i.e., the log file's path on the DPU is /var/log/snap-log/rpc-log
as well).
The SNAP_RPC_LOG_ENABLE
env can be used to enable (1
) or disable (0
) this feature.
RPC log history is supported with SPDK version spdk23.01.2-12 and above.
When RPC log history is enabled, the SNAP application writes (in append mode) RPC request and response message to /var/log/snap-log/rpc-log
constantly. Pay attention to the size of this file. If it gets too large, delete the file on the DPU before launching the SNAP pod.
SNAP IO Level Statistics
Debug counters provide I/O statistics for each controller, offering insights into the distribution of I/O across different queues and the total I/O received by the controller.
For Virtio-Blk I/O statistics, use:
snap_rpc.py virtio_blk_controller_dbg_io_stats_get
(Refer to virtio_blk_controller_dbg_io_stats_get)For NVMe I/O statistics, use:
snap_rpc.py nvme_controller_dbg_io_stats_get
(Refer to nvme_controller_dbg_io_stats_get)
For details on known bugs and limitations, please refer to the SNAP Known Issues.
SPDK and SNAP Compatibility Issues
Each SNAP container release is bundled with the latest available NVIDIA SPDK. If you need to replace the SPDK version with a custom one, follow the instructions provided here:
For the SNAP container: Building SNAP Container with Custom SPDK
For the SNAP source package: Replacing the BFB SPDK in SNAP Deployment
If a build failure occurs due to compatibility issues between SNAP and SPDK, the /service/compat/spdk
folder contains the necessary infrastructure to address these compatibility issues.
SNAP Service Fails to Load Due to a Incorrect Firmware Configuration
If the firmware is not configured for NVMe or Virtio-blk, an error log message appears when attempting to load SNAP.
To enable the storage emulation based on your desired configuration, follow the instructions provided in Firmware Configuration.
Fewer Queues Created than Configured
By default, SNAP attempts to create the maximum number of queues within the MSIX limitation (up to 63).
If a higher number of queues are requested during controller creation, they are still subject to the MSIX limitation.
For configuring the number of MSIX entries, refer to: DPU Firmware Configuration
To dynamically manage MSIX, refer to: SR-IOV Dynamic MSIX Management
IO Failures During High Throughput with NVMeTCP Using XLIO
The default XLIO TCP configuration required for NVMeTCP is included in the SNAP container or source package. However, when scaling up tests, IO failures may occur specifically when using NVMeTCP. It is recommended to consult the Monitoring, Debugging, and Troubleshooting section of the NVIDIA Accelerated IO (XLIO) Documentation for guidance.
Deploying Container on Setups Without Internet
To deploy a container in environments without internet access, refer to Deploying Container on Setups Without Internet Connectivity.
Managing SNAP Service Memory Consumption
DPU memory is shared among all DPU services, and scaling the SNAP configuration may lead to memory shortages.
To understand SNAP memory configuration and usage, please refer to SNAP Memory Consumption.
Container Image Corruption
If the container image becomes corrupted and the container status shows as as exited with the error message /usr/bin/supervisord: exec format error
, follow these steps:
Remove the YAML from kubelet.
Use
crictl images
to list the images andcrictl rmi <image-id>
to remove the image.Restart the containerd and kubelet services with
systemctl restart containerd
andsystemctl restart kubelet
, respectively.Reapply/copy the YAML file to kubelet.
For additional information on container deployment and debugging, refer to SNAP Container Deployment.
Troubleshooting Issues When Enabling SR-IOV
When enabling SNAP virtual functions, ensure the host is configured according to the guidelines outlined in Host OS Configuration.
Additionally, follow the DPU firmware configuration instructions provided in SR-IOV Firmware Configuration.
Tuning the Host to Avoid Poor Performance Issues
To optimize host performance, follow the Intel configuration guidelines provided in Intel Host OS Configuration.
For AMD configurations, refer to AMD Host OS Configuration.
Recovering from a Service Crash
To recover from a service crash, review the recovery procedures and instructions for enabling recovery mode for NVMe and Virtio-Blk in Recovery.
Essential Debug Information for SNAP Service/Container Issues
To effectively debug issues encountered during the deployment and operation of the SNAP service/container, gathering the following information is required:
From the Host Machine
Host OS/kernel version, use:
uname -a
andcat /etc/os-release
Host CPU model, use:
lscpu
Host commands:
Driver load/unload commands
Functions management (VF assignment, SR-IOV commands, FLR events)
Storage application commands (e.g.,
fio
testing app command)
Host
dmesg
output
From the BlueField Platform
HW model, use:
/hpc/local/bin/lshca
BFB version, use:
cat /etc/os-release
SPDK version (if a non-default version is used)
XLIO version (if a non-default version is used)
FW version, use:
mlxfwmanager
FW configuration (refer to SR-IO Firmware Configuration)
Hugepage memory configuration (Refer to Allocate Hugepages under
/etc/sysctl.conf
)SNAP container configuration (Refer to YAML file)
SNAP state, using the output of the following RPCs:
snap_rpc.py emulation_function_list
(refer to emulation_function_list)snap_rpc.py virtio_blk_controller_list
(in case of Virtio-Blk, refer to virtio_blk_controller_list)snap_rpc.py nvme_subsystem_list
(in case of NVMe, refer to nvme_subsystem_list )spdk_rpc.py bdev_get_bdevs
(see SPDK JSON-RPC Documentation)
The SNAP and SPDK initialization RPCs (defined in
etc/nvda_snap/spdk_init.conf
andetc/nvda_snap/snap_init.conf
)SNAP RPC logs (check the SNAP RPC log history, refer to RPC Log History)
SNAP logs
InfoTo export container logs if the container crashes, use
ls /var/log/containers/
. Logs are prefixed with the container ID (e.g.,94cdeb21b031b
), visible undercrictl ps
. If the container restarts, you may find multiple logs.
Additional Logs:
Kubernetes logs, use:
journalctl -u kubelet > #log_file_name
DPU logs, check
/var/log/messages
and/var/log/dmesg