DPL Container Deployment
The DPL Runtime Service supports Bluefield devices (DPU mode) and ConnectX-9 (Host mode).
There are few differences when preparing the system for running the DPL Runtime Service on DPU mode (BlueField) vs Host mode (ConnectX-9), these differences are outlined in the sections below where applicable.
This section is specific to DPU mode when running on BlueField devices.
Setting BlueField to DPU Mode
BlueField must run in DPU mode to use the DPL Runtime Service. For details on how to change modes, see: BlueField Modes of Operation.
Determining Your BlueField Variant
Your BlueField may be installed in a host server or it may be a standalone server.
If your BlueField is a standalone server, ignore the parts that mention the host server or SR-IOV. You may still use Scalable Functions (SFs) if your BlueField is a standalone server.
Setting Up DPU Management Access and Updating BlueField-Bundle
These pages provide detailed information about DPU management access, software installation, and updates:
Systems with a host server typically use RShim (i.e., the tmfifo_net0 interface). Standalone systems must use the OOB interface option for management access.
Changing the eSwitch to switchdev mode
Do this before creating SR-IOV Virtual functions. In case Virtual Functions already exist for the interface, remove them before trying to change the mode.
The DPL Runtime Service can only start if the eSwitch is in switchdev mode. If it's not, an error will be logged on startup and the process will exit.
If the platform is Bluefield in DPU mode, run this command in the DPU shell, otherwise, (e.g. ConnectX-9), use the host shell.
Your Bluefield DPU may be pre-configured in switchdev mode after the bfb installation. If this is the case, this step may be unnecessary.
Find the PCI address of the interface that you'd like to use with the DPL Runtime Service and use the following command (replace the pci/<addr> part with correct values)
Example
sudo devlink dev eswitch set pci/0000:03:00.0 mode switchdev
Here are a few options for commands that may help you find your PCI address:
lspci -Dmst status -vip -d linkethtool -i <interface name>
devlink settings are not persistent across reboots.
Enabling Multiport eSwitch Mode (Optional)
This step is optional and depends on your DPL program and setup needs.
Multiport eSwitch mode allows for traffic forwarding between multiple physical ports and their VFs/SFs (e.g., between p0 and p1).
Before enabling this mode:
Ensure
LAG_RESOURCE_ALLOCATIONis enabled in firmware:Example
sudo mlxconfig -d 0000:03:00.0 s LAG_RESOURCE_ALLOCATION=1
InfoRefer to the Using mlxconfig guide for more information.
After reboot or firmware reset, enable
esw_multiportmode:Example
sudo devlink dev param set pci/0000:03:00.0 name esw_multiport value 1 cmode runtime
devlink settings are not persistent across reboots.
Creating SR-IOV Virtual Functions
To use SR-IOV, first create Virtual Functions (VFs) on the host server:
Example
sudo -s # enter sudo shell
echo 4 > /sys/class/net/eth2/device/sriov_numvfs
exit # exit sudo shell
Entering a sudo shell is necessary because sudo only applies to the echo command, and not the redirection (>), which would otherwise result in "Permission denied."
This example creates 4 VFs under Physical Function eth2. Adjust the number as needed.
If a PF already has VFs and you'd like to change the number, first set it to 0 before applying the new value.
Creating Scalable Functions (Optional)
This step is optional and depends on your DPL program and setup needs.
For more information, see the BlueField Scalable Function User Guide, TODO: CX9.
If you create SFs, refer to their representors in the configuration file.
Downloading Container Resources from NGC
Start by downloading and installing the ngc-cli tools.
For example:
For DPU mode, download the ARM ngc-cli tool:
Example
wget --content-disposition https://api.ngc.nvidia.com/v2/resources/nvidia/ngc-apps/ngc_cli/versions/4.5.2/files/ngccli_arm64.zip -O ngccli_arm64.zip unzip ngccli_arm64.zip
For Host mode, download the appropriate ngc-cli tool for your system architecture:
Example for x86_64
wget --content-disposition https://api.ngc.nvidia.com/v2/resources/nvidia/ngc-apps/ngc_cli/versions/4.5.2/files/ngccli_linux.zip -O ngccli_linux.zip unzip ngccli_linux.zip
Once the ngc-cli tool has been downloaded, use it to download the latest dpl_rt_service resources:
./ngc-cli/ngc registry resource download-version "nvidia/doca/dpl_rt_service"
This creates a directory in the format dpl_rt_service_va.b.c-docax.y.z. Where a.b.c is the DPL Runtime Service version number, and x.y.z is the DOCA version number.
For example: dpl_rt_service_v1.2.0-doca3.1.0.
You can find available versions at NGC Catalog.
Each release includes a kubelet.d YAML file that is used by the dpl_rt_service_ctl.sh script for retrieving the correct container image for either DPU or Host mode.
Running the Preparation Script
Run the dpl_system_setup.sh script to configure the system:
For DPU mode:
cddpl_rt_service_va.b.c-docax.y.zchmod+x ./scripts/dpl_system_setup.shsudo./scripts/dpl_system_setup.shsudosystemctl restart kubelet.servicesudosystemctl restart containerd.serviceWarningFor DPU mode, restarting
kubeletandcontainerdis required wheneverhugepagesconfiguration changes for the changes to take effect.For Host mode, specify the ConnectX device(s) that should be configured for DPL use using the
--devoption (this option can be repeated):cddpl_rt_service_va.b.c-docax.y.zchmod+x ./scripts/dpl_system_setup.shsudo./scripts/dpl_system_setup.sh --dev 0000:08:00.0
The dpl_system_setup.sh script will perform the following:
Configures
mlxconfigvalues:FLEX_PARSER_PROFILE_ENABLE=4PROG_PARSE_GRAPH=trueSRIOV_EN=1
Enables SR-IOV
Sets up initial DPL Runtime Service configuration folder at
/etc/dpl_rt_service/Configures hugepages
Please note that the dpl_system_setup.sh script takes optional arguments to control the hugepages.
For DPU mode, if you set it larger than 4GB of hugepages, you also have to modify dpl_rt_service.yaml with a higher limit for spec->resources->limits->hugepages-2Mi
Editing the Configuration Files
Create device(s) configuration file based on the provided template config file.
See DPL Service Configuration for details.
For example:
Example
sudo cp /etc/dpl_rt_service/devices.d/NAME.conf.template /etc/dpl_rt_service/devices.d/1000.conf
# Then update /etc/dpl_rt_service/devices.d/1000.conf as needed.
sudo vim /etc/dpl_rt_service/devices.d/1000.conf
You must create at least one device configuration file.
Otherwise, the DPL Runtime Service Container will not be able to start.
Firewall configuration to open gRPC server ports
The DPL Runtime Service has a couple of gRPC servers, each listening on a dedicated TCP port, supporting a corresponding DPL Developer tool.
It is critical to make sure that these ports are accessible from the system(s) where you plan to run the DPL Developer tools.
Those tools will connect to the DPL Runtime Service using the corresponding tool's gRPC server TCP port.
The needed ports are configurable (see
server_tcp_port settings at
DPL Service Configuration). By default they have the following values:
gRPC server | TCP Port |
P4 Runtime | 9559 |
DPL Admin | 9600 |
DPL Nspect/ Debugger | 9560 |
Example for allowing the ports on RHEL-9
sudo firewall-cmd --permanent --add-port=9559/tcp
sudo firewall-cmd --permanent --add-port=9600/tcp
sudo firewall-cmd --permanent --add-port=9560/tcp
sudo firewall-cmd --reload
# List Configurations to confirm ports were allowed.
sudo firewall-cmd --list-all
Starting the DPL Runtime Service Container
Once your configuration files are ready, use the dpl_rt_service_ctl.sh script to start the container:
Before running the script for the first time, user must grant it execution rights:
sudo chmod +x ./scripts/dpl_rt_service_ctl.sh
sudo ./scripts/dpl_rt_service_ctl.sh --start
For DPU mode, the script will copy the YAML file into the/etc/kubelet.d/directory, which will trigger automatic creation and start of DPL RT Service Pod and container.
For Host mode, the script will start a Docker container named dpl-rt-service.
Allow a few minutes for the container to start. To monitor status:
For DPU mode:
Check logs:
sudojournalctl -u kubelet --since -5mList images:
sudocrictl imagesList pods:
sudocrictl pods
For Host mode:
Check logs:
sudodocker logs dpl-rt-serviceList images:
sudodocker images
View runtime logs:
/var/log/doca/dpl_rt_service/dpl_rtd.log
If the container fails to start due to configuration errors, then the log file at /var/log/doca/dpl_rt_service/dpl_rtd.log might be empty or missing the relevant error logs.
In such case, you can view logs with the relevant errors using the relevant tool:
For DPU:
sudo crictl logs $(sudo crictl ps -a | grep dpl-rt-service | awk '{print $1}')
For Host:
sudo docker logs dpl-rt-service
Stopping the DPL Runtime Service Container
Stop the container by using the dpl_rt_service_ctl.sh script:
sudo ./scripts/dpl_rt_service_ctl.sh --stop
For DPU mode, the script will remove the YAML file from the /etc/kubelet.d/ directory.
For Host mode, the script will stop the Docker container named dpl-rt-service.
To confirm the pod is gone (this might take a few seconds to complete):
# For DPU:
sudo crictl pods | grep dpl-rt-service
# For Host:
sudo docker ps | grep dpl-rt-service
Restarting the DPL Runtime Service After Configuration Changes
Once the DPL Runtime Service container is up and running, any change to any file under the /etc/dpl_rt_service/ configuration folder requires restarting the container in order for the new changes to take effect.
Perform the following steps to restart the container:
Stop the container.
sudo./scripts/dpl_rt_service_ctl.sh --stopWait for the container to stop.
Start the container:
sudo./scripts/dpl_rt_service_ctl.sh --start
End-to-End Installation Steps
Replace device IDs and filenames as appropriate for your setup.
DPU Example
# Download NGC CLI tool:
wget --content-disposition https://api.ngc.nvidia.com/v2/resources/nvidia/ngc-apps/ngc_cli/versions/4.5.2/files/ngccli_arm64.zip -O ngccli_arm64.zip
unzip ngccli_arm64.zip
# Download the DPL Runtime Service Resources bundle:
./ngc-cli/ngc registry resource download-version "nvidia/doca/dpl_rt_service"
# Prepare DPU and restart services:
cd dpl_rt_service_va.b.c-docax.y.z
chmod +x ./scripts/dpl_system_setup.sh
sudo ./scripts/dpl_system_setup.sh
sudo systemctl restart kubelet.service
sudo systemctl restart containerd.service
# Create a device configuration file with relevant interfaces info:
sudo cp /etc/dpl_rt_service/devices.d/NAME.conf.template /etc/dpl_rt_service/devices.d/1000.conf
sudo vim /etc/dpl_rt_service/devices.d/1000.conf
# Launch the Pod and container:
sudo ./scripts/dpl_rt_service_ctl.sh --start
Host Example
# Download NGC CLI tool:
wget --content-disposition https://api.ngc.nvidia.com/v2/resources/nvidia/ngc-apps/ngc_cli/versions/4.5.2/files/ngccli_linux.zip -O ngccli_linux.zip
unzip ngccli_linux.zip
# Download the DPL Runtime Service Resources bundle:
./ngc-cli/ngc registry resource download-version "nvidia/doca/dpl_rt_service"
# Prepare DPU and restart services:
cd dpl_rt_service_va.b.c-docax.y.z
chmod +x ./scripts/dpl_system_setup.sh
sudo ./scripts/dpl_system_setup.sh --dev 0000:08:00.0
# Create a device configuration file with relevant interfaces info:
sudo cp /etc/dpl_rt_service/devices.d/NAME.conf.template /etc/dpl_rt_service/devices.d/1000.conf
sudo vim /etc/dpl_rt_service/devices.d/1000.conf
# Launch the Docker container:
sudo ./scripts/dpl_rt_service_ctl.sh --start
For additional troubleshooting steps and deeper explanations, refer to BlueField Container Deployment Guide.
Checkpoint | Command |
View recent kubelet logs (DPU only) |
|
View logs of the Helpful if | For DPU: For Host: |
List pulled container images | For DPU: For Host: |
List all created pods (DPU only) |
|
List running containers | For DPU: For Host: |
View DPL service logs |
|
Make sure the following conditions are met before or during deployment:
VFs were created before deploying the container (if using SR-IOV)
All required configuration files exist under
/etc/dpl_rt_service/, are correctly named, and include valid device IDsNetwork interface names and MTU settings match the physical and virtual network topology
Firmware is up to date and matches DOCA compatibility requirements
For DPU mode, BlueField is operating in the correct mode (DPU mode) using
sudo mlxconfig -d <pci-device> q