What can I help you with?
NVIDIA BlueField Platform Software Troubleshooting Guide

Virtio-blk

This page intends to assist SNAP users and developers in troubleshooting and resolving common issues when working with SNAP containers or source packages.

It is recommended to consult this page only after reviewing the latest BlueField SNAP for NVMe and Virtio-blk Documentation.

Verbosity Level

SNAP allows dynamically changing the log level of the logger backend using the snap_log_level_set. Any log under the requested level is shown.

Parameter

Mandatory?

Type

Description

level

Yes

Number

Log level

  • 0 – Critical

  • 1 – Error

  • 2 – Warning

  • 3 – Info

  • 4 – Debug

  • 5 – Trace

Examine the logs of the SNAP container:

Copy
Copied!
            

crictl logs -f $(crictl ps -s running -q --name snap)


SNAP RPC Help

Use snap_rpc.py --help to review the supported RPCs and learn the parameters per command.

SNAP State Query

List Emulation Devices

Use the following command to list emulation devices:

Copy
Copied!
            

snap_rpc.py emulation_function_list

For detailed information, refer to emulation_function_list.

List Virtio-Blk Controllers and Their Configurations

To list Virtio-Blk controllers and view the configurations and state for each one, use the following command:

Copy
Copied!
            

snap_rpc.py virtio_blk_controller_list

For detailed information, refer to virtio_blk_controller_list.

List NVMe Subsystems, Controllers and Namespaces

To list NVMe subsystems, controllers, and namespaces under each subsystem, including their configurations and state, use the following command:

Copy
Copied!
            

snap_rpc.py nvme_subsystem_list

For detailed information, refer to nvme_subsystem_list.

List NVMe Controllers Including Configurations and State

To list NVMe controllers along with their configurations and state, use the following command:

Copy
Copied!
            

snap_rpc.py nvme_controller_list

For detailed information, refer to nvme_controller_list.

List NVMe Namespaces

To list NVMe namespaces, use the following command:

Copy
Copied!
            

snap_rpc.py nvme_namespace_list

For detailed information, refer to nvme_namespace_list.

List SPDK BDevs

To list SPDK BDevs, use the following command:

Copy
Copied!
            

spdk_rpc.py bdev_get_bdevs

For detailed information, refer to SPDK JSON-RPC Documentation.

RPC Log History

RPC log history (enabled by default) records all the RPC requests (from snap_rpc.py and spdk_rpc.py) sent to the SNAP application and the RPC response for each RPC requests in a dedicated log file, /var/log/snap-log/rpc-log. This file is visible outside the container (i.e., the log file's path on the DPU is /var/log/snap-log/rpc-log as well).

The SNAP_RPC_LOG_ENABLE env can be used to enable (1) or disable (0) this feature.

Info

RPC log history is supported with SPDK version spdk23.01.2-12 and above.

Warning

When RPC log history is enabled, the SNAP application writes (in append mode) RPC request and response message to /var/log/snap-log/rpc-log constantly. Pay attention to the size of this file. If it gets too large, delete the file on the DPU before launching the SNAP pod.

SNAP IO Level Statistics

Debug counters provide I/O statistics for each controller, offering insights into the distribution of I/O across different queues and the total I/O received by the controller.

Info

For details on known bugs and limitations, please refer to the SNAP Known Issues.

SPDK and SNAP Compatibility Issues

Each SNAP container release is bundled with the latest available NVIDIA SPDK. If you need to replace the SPDK version with a custom one, follow the instructions provided here:

If a build failure occurs due to compatibility issues between SNAP and SPDK, the /service/compat/spdk folder contains the necessary infrastructure to address these compatibility issues.

SNAP Service Fails to Load Due to a Incorrect Firmware Configuration

If the firmware is not configured for NVMe or Virtio-blk, an error log message appears when attempting to load SNAP.

To enable the storage emulation based on your desired configuration, follow the instructions provided in Firmware Configuration.

Fewer Queues Created than Configured

By default, SNAP attempts to create the maximum number of queues within the MSIX limitation (up to 63).

If a higher number of queues are requested during controller creation, they are still subject to the MSIX limitation.

For configuring the number of MSIX entries, refer to: DPU Firmware Configuration

To dynamically manage MSIX, refer to: SR-IOV Dynamic MSIX Management

IO Failures During High Throughput with NVMeTCP Using XLIO

The default XLIO TCP configuration required for NVMeTCP is included in the SNAP container or source package. However, when scaling up tests, IO failures may occur specifically when using NVMeTCP. It is recommended to consult the Monitoring, Debugging, and Troubleshooting section of the NVIDIA Accelerated IO (XLIO) Documentation for guidance.

Deploying Container on Setups Without Internet

To deploy a container in environments without internet access, refer to Deploying Container on Setups Without Internet Connectivity.

Managing SNAP Service Memory Consumption

DPU memory is shared among all DPU services, and scaling the SNAP configuration may lead to memory shortages.

To understand SNAP memory configuration and usage, please refer to SNAP Memory Consumption.

Container Image Corruption

If the container image becomes corrupted and the container status shows as as exited with the error message /usr/bin/supervisord: exec format error, follow these steps:

  • Remove the YAML from kubelet.

  • Usecrictl images to list the images and crictl rmi <image-id> to remove the image.

  • Restart the containerd and kubelet services with systemctl restart containerd and systemctl restart kubelet, respectively.

  • Reapply/copy the YAML file to kubelet.

For additional information on container deployment and debugging, refer to SNAP Container Deployment.

Troubleshooting Issues When Enabling SR-IOV

When enabling SNAP virtual functions, ensure the host is configured according to the guidelines outlined in Host OS Configuration.

Additionally, follow the DPU firmware configuration instructions provided in SR-IOV Firmware Configuration.

Tuning the Host to Avoid Poor Performance Issues

To optimize host performance, follow the Intel configuration guidelines provided in Intel Host OS Configuration.

For AMD configurations, refer to AMD Host OS Configuration.

Recovering from a Service Crash

To recover from a service crash, review the recovery procedures and instructions for enabling recovery mode for NVMe and Virtio-Blk in Recovery.

Essential Debug Information for SNAP Service/Container Issues

To effectively debug issues encountered during the deployment and operation of the SNAP service/container, gathering the following information is required:

From the Host Machine

  1. Host OS/kernel version, use: uname -a and cat /etc/os-release

  2. Host CPU model, use: lscpu

  3. Host commands:

    • Driver load/unload commands

    • Functions management (VF assignment, SR-IOV commands, FLR events)

    • Storage application commands (e.g., fio testing app command)

  4. Host dmesgoutput

From the BlueField Platform

  1. HW model, use:/hpc/local/bin/lshca

  2. BFB version, use: cat /etc/os-release

  3. SPDK version (if a non-default version is used)

  4. XLIO version (if a non-default version is used)

  5. FW version, use: mlxfwmanager

  6. FW configuration (refer to SR-IO Firmware Configuration)

  7. Hugepage memory configuration (Refer to Allocate Hugepages under /etc/sysctl.conf)

  8. SNAP container configuration (Refer to YAML file)

  9. SNAP state, using the output of the following RPCs:

  10. The SNAP and SPDK initialization RPCs (defined in etc/nvda_snap/spdk_init.conf and etc/nvda_snap/snap_init.conf)

  11. SNAP RPC logs (check the SNAP RPC log history, refer to RPC Log History)

  12. SNAP logs

    Info
    • To export container logs if the container crashes, use ls /var/log/containers/. Logs are prefixed with the container ID (e.g., 94cdeb21b031b), visible under crictl ps. If the container restarts, you may find multiple logs.

  13. Additional Logs:

    • Kubernetes logs, use: journalctl -u kubelet > #log_file_name

    • DPU logs, check /var/log/messages and /var/log/dmesg

© Copyright 2024, NVIDIA. Last updated on Nov 12, 2024.