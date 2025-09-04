The .yaml file configuration for the SNAP container is doca_snap.yaml . The download command of the .yaml file can be found on the DOCA SNAP NGC page.

Note Internet connectivity is necessary for downloading SNAP resources. To deploy the container on DPUs without Internet connectivity, refer to appendix "Deploying Container on Setups Without Internet Connectivity".

The .yaml file can easily be edited for advanced configuration.

The SNAP .yaml file is configured by default to support Ubuntu setups (i.e., Hugepagesize = 2048 kB) by using hugepages-2Mi. To support other setups, edit the hugepages section according to the DPU OS's relevant Hugepagesize value. For example, to support CentOS 8.x configure Hugepagesize to 512MB: Copy Copied! limits: hugepages-512Mi: "<number-of-hugepages>Gi" Note When deploying SNAP with a large number of controllers (500 or more), the default allocation of hugepages (4GB) may become insufficient. This shortage of hugepages can be identified through error messages, typically indicate failures in creating or modifying QPs or other objects. In these cases, more hugepages needed.

The following example edits the .yaml file to request 16 CPU cores for the SNAP container and 4Gi memory and 4Gi hugepages: Copy Copied! resources: requests: memory: "2Gi" hugepages-2Mi: "4Gi" cpu: "8" limits: memory: "4Gi" hugepages-2Mi: "4Gi" cpu: "16" env: - name: APP_ARGS value: "-m 0xffff" Note If all BlueField-3 cores are requested, the user must verify no other containers are in conflict over the CPU resources. Note When running the Virtio-fs service with a large number of cores, it is necessary to increase the number of IO buffers in SPDK. For example, to run with 16 cores, the size of the large IO buffer pool must be set to at least 4095. This can be configured by adding the RPC command iobuf_set_options --large-pool-count 4095 to spdk_rpc_init.conf under /etc/nvda_snap . Depending on the scale and SPDK subsystems in use other SPDK configuration parameters may need to be adjusted. Refer to SPDK documentation for more details.

To automatically configure SNAP container upon deployment, edit the files below according to the use case. During bring-up, SNAP will forward the content of these files into the appropriate RPC script, whether SPDK RPCs or SNAP RPCs. Ensure that the required RPCs for your use case are included. Add spdk_rpc_init.conf file under /etc/nvda_snap/ . The file includes the required SPDK RPCs. File example: Copy Copied! bdev_malloc_create 64 512 Add snap_rpc_init.conf file under /etc/nvda_snap / . The file includes the required SPDK RPCs. Virtio-blk file example: Copy Copied! virtio_blk_controller_create --pf_id 0 --bdev Malloc0 NVMe file example: Copy Copied! nvme_subsystem_create --nqn nqn.2022-10.io.nvda.nvme:0 nvme_namespace_create -b Malloc0 -n 1 --nqn nqn.2022-10.io.nvda.nvme:0 --uuid 16dab065-ddc9-8a7a-108e-9a489254a839 nvme_controller_create --nqn nqn.2022-10.io.nvda.nvme:0 --ctrl NVMeCtrl1 --pf_id 0 --suspended nvme_controller_attach_ns -c NVMeCtrl1 -n 1 nvme_controller_resume -c NVMeCtrl1 Edit the .yaml file accordingly (uncomment): Copy Copied! env: - name: SPDK_RPC_INIT_CONF value: "/etc/nvda_snap/spdk_rpc_init.conf" - name: SNAP_RPC_INIT_CONF value: "/etc/nvda_snap/snap_rpc_init.conf" Note It is user responsibility to make sure SNAP configuration matches firmware configuration. That is, an emulated controller must be opened on all existing (static/hotplug) emulated PCIe functions (either through automatic or manual configuration). A PCIe function without a supporting controller is considered malfunctioned, and host behavior with it is anomalous.



Run the Kubernetes tool:

Copy Copied! [dpu] systemctl restart containerd [dpu] systemctl restart kubelet [dpu] systemctl enable kubelet [dpu] systemctl enable containerd

Copy the updated doca_snap.yaml file to the /etc/kubelet.d directory.

Kubelet automatically pulls the container image from NGC described in the YAML file and spawns a pod executing the container.

Copy Copied! cp doca_snap.yaml /etc/kubelet.d/

The SNAP service starts initialization immediately, which may take a few seconds. To verify SNAP is running:

Look for the message "SNAP Service running successfully" in the log

Send spdk_rpc.py spdk_get_version to confirm whether SNAP is operational or still initializing

View currently active pods, and their IDs (it might take up to 20 seconds for the pod to start):

Copy Copied! crictl pods

Example output:

Copy Copied! POD ID CREATED STATE NAME 0379ac2c4f34c About a minute ago Ready snap

View currently active containers, and their IDs:

Copy Copied! crictl ps

View existing containers and their ID:

Copy Copied! crictl ps -a

Examine the logs of a given container (SNAP logs):

Copy Copied! crictl logs <container_id>

Examine the kubelet logs if something does not work as expected:

Copy Copied! journalctl -u kubelet

The container log file is automatically saved by Kubelet to /var/log/containers/ , using the filename format: <pod_name>_default_snap-<container_id>.log .

Refer to section "RPC Log History" for more logging information.

To persist a custom log level across container restarts—or ensure it is applied during startup—add the relevant configuration command to the snap_rpc_init.conf file located at /etc/nvda_snap/ .

Log level can also be modified at runtime using the snap_log_level_set RPC. For more details, refer to section "Log Management".

By default, the source package version of SNAP does not save logs automatically. To enable logging, follow the instructions in section "Run SNAP Service". For additional debugging information, refer to section "Build with Debug Prints Enabled".

To redirect SNAP output to a file, use the following command:

Copy Copied! /opt/nvidia/nvda_snap/bin/snap_service > snap.log 2>&1





SNAP is integrated with SOS—a framework for consistent and structured log collection.

To generate a log package:

Clone the SOS report tool: https://github.com/NVIDIA/doca-sosreport. Follow the installation instructions in the repository. Run the following command: Copy Copied! sos report --only snap_service,container_log

This creates a comprehensive log package. You may include additional plugins depending on the nature of the issue. For more details, refer to the "Collecting DOCA Logs for NVIDIA Inspection" page.

SNAP binaries are deployed within a Docker container as SNAP service, which is managed as a supervisorctl service. Supervisorctl provides a layer of control and configuration for various deployment options.

In the event of a SNAP crash or restart, supervisorctl detects the action and waits for the exited process to release its resources. It then deploys a new SNAP process within the same container, which initiates a recovery flow to replace the terminated process.

In the event of a container crash or restart, kubeletclt detects the action and waits for the exited container to release its resources. It then deploys a new container with a new SNAP process, which initiates a recovery flow to replace the terminated process.

Note After containers crash or exit, the kubelet restarts them with an exponential back-off delay (10s, 20s, 40s, etc.) which is capped at five minutes. Once a container has run for 10 minutes without an issue, the kubelet resets the restart back-off timer for that container. Restarting the SNAP service without restarting the container helps avoid the occurrence of back-off delays.

To kill the container, remove the .yaml file form /etc/kubelet.d/ . To start the container, cp the .yaml file back to the same path: Copy Copied! cp doca_snap.yaml /etc/kubelet.d/

To restart the container (with sig-term) using crictl , use the -t (timeout) option: Copy Copied! crictl stop -t 10 <container-id>

To restart the SNAP service without restarting the entire container, user can either use the supervictl tool to restart the SNAP service or terminate the SNAP service process on the DPU. Different signals correspond to different termination behaviors. For example:

Restart sends SIGTERM Copy Copied! crictl exec -it $(crictl ps -s running -q --name snap) supervisorctl restart snap

Pkill sends SIGKILL Copy Copied! pkill -9 -f snap

Note SNAP service termination may take time as it releases all allocated resources. The duration depends on the scale of the use case and any other applications sharing resources with SNAP.





The duration can be improved by configuring supervisorctl to give the exited SNAP process a shorter or zero termination time when using supervisorctl restart snap .

This causes the new process to start up while the old process' resources are still being freed by the kernel.

The user must ensure that the hugepage allocation is sufficient to accommodate both processes running in parallel. To modify the time SNAP takes to exit, the user should use the relevant environment variable SUPERVISOR_STOPWAITSECS .

