Firefly
The DOCA Firefly Service delivers Precision Time Protocol (PTP) based time synchronization for the BlueField DPU.
PTP is a protocol designed to synchronize clocks within a network. When paired with hardware support, PTP can achieve sub-microsecond accuracy, significantly surpassing the typical precision of Network Time Protocol (NTP). PTP functionality is managed across both kernel and user space, with the ptp4l
program handling PTP boundary and ordinary clocks. With hardware time stamping, ptp4l
ensures synchronization of the PTP hardware clock to the master clock.
For further details about DOCA Firefly, please consult the relevant NVIDIA DOCA Firefly Service Guide.
Firmware Settings
Query status of the Real Time Clock (RTC)
To check the status of the RTC on the DPU, use the following command:
$ sudo
mlxconfig -d 03:00.0 q | grep
REAL_TIME_CLOCK_ENABLE
# Example output
REAL_TIME_CLOCK_ENABLE True(1)
Enabling the Real Time Clock (RTC)
To enable RTC, run:
$ sudo
mlxconfig -d 03:00.0 set
REAL_TIME_CLOCK_ENABLE=1
A graceful shutdown and power cycle of the DPU are required for the changes to take effect.
Open vSwitch (OVS) Configuration
Check Hardware Offload Support
To verify if hardware offload is enabled, run:
$ sudo
ovs-vsctl get Open_vSwitch . other_config | grep
hw-offload
# Example output
{hw-offload="true"
}
Enable Hardware Offload Support
Activate hardware offloading, run:
$
sudo
ovs-vsctlset
Open_vSwitch . other_config:hw-offload=true
;Restart the OVS service:
$
sudo
/etc/init.d/openvswitch-switch restart
Examine Switch Settings
To examine the current switch settings, run:
$ sudo
ovs-vsctl show
Bridge uplink
Port pf0hpf
Interface pf0hpf
Port en3f0pf0sf4
Interface en3f0pf0sf4
Port p0
Interface p0
Port uplink
Interface uplink
type
: internal
Add New Bridge / Port
To add a new bridge or port, run:
$ sudo
ovs-vsctl add-br <bridge name>
$ sudo
ovs-vsctl add-port <bridge name> <port name>
Example for DOCA Firefly Deployment
$ sudo
ovs-vsctl add-br uplink
$ sudo
ovs-vsctl add-port uplink p0
$ sudo
ovs-vsctl add-port uplink en3f0pf0sf4
# This port is needed to ensure we have traffic host<->network as well
$ sudo
ovs-vsctl add-port uplink pf0hpf
Network Interface Configuration
Enable Hardware Tx Port Timestamping
$ sudo
ethtool
--set
-priv-flags enp3s0f0s4 tx_port_ts on
Configure IP Address for Interface
$ sudo
ifconfig
enp3s0f0s4 <ip-addr> up
Container Runtime Commands
When deploying a new container, it is recommended to follow this procedure to ensure the successful execution of each step throughout the deployment process:
View Currently Active Pods and their IDs
$ sudo crictl pods
It may take up to 20 seconds for the pod to start.
When deploying a new container, look for a corresponding entry line in the command's output:
POD ID CREATED STATE NAME NAMESPACE ATTEMPT RUNTIME
06bd84c07537e 4
seconds ago Ready doca-firefly-my-dpu default
0
(default
)
Review Kubelet Logs
If no matching line appears, it is recommended to check the Kubelet logs for more details about the error:
$ sudo journalctl -u kubelet --since -5m
Once the issue is resolved, proceed to the next steps.
Verify Download of Container Image from NGC
Verify that the container image is successfully downloaded from NGC into the DPU's container registry (download time may vary based on the size of the container image):
$ sudo crictl images
Example output:
IMAGE TAG IMAGE ID SIZE
k8s.gcr.io/pause 3.2
2a060e2e7101d 251kB
nvcr.io/nvidia/doca/doca_firefly 1.1
.0
-doca2.0.2
134cb22f34611 87
.4MB
View Currently Active Containers
View currently active containers and their IDs:
$ sudo crictl ps
Once again, find a corresponding entry line for the deployed container (boot time may vary depending on the container's image size):
CONTAINER IMAGE CREATED STATE NAME ATTEMPT POD ID POD
b505a05b7dc23 134cb22f34611 4
minutes ago Running doca-firefly 0
06bd84c07537e doca-firefly-my-dpu
In case of failure to find a matching container, review the list of all recent container deployments:
$ sudo crictl ps -a
It is possible that the container encountered an error during boot and exited right away:
CONTAINER IMAGE CREATED STATE NAME ATTEMPT POD ID POD
de2361ec15b61 134cb22f34611 1
second ago Exited doca-firefly 1
4aea5f5adc91d doca-firefly-my-dpu
Review Logs of a Container
During the container's runtime, and for a short timespan after it exits, you can view the containers logs that were printed to the standard output:
$ sudo crictl logs <container-id>
In this case, the user can learn from the log that the wrong configuration was passed to the container:
$ sudo crictl logs de2361ec15b61
Starting DOCA Firefly - Version 1.1
.0
...
Requested the following PTP interface
: p10
Failed to find interface
"p10"
. Aborting
For additional information and guides on using crictl
, refer to the Debugging Kubernetes Nodes with crictl.
Stop a Running Container
The recommended way to stop a pod and its containers is as follows:
Delete the
.yaml
configuration file for Kubelet to stop the pod:$ rm /etc/kubelet.d/<file name>.yaml
Stop the pod directly (only if it still shows "Ready"):
$ sudo crictl stopp <pod-id>
Once the pod stops, it may also be necessary to stop the container itself:
$ sudo crictl stop <container-id>
DOCA Firefly generates multiple log files, each corresponding to a specific module:
Runtime (Administrator) Logs
Main container log:
/var/log/doca/firefly/firefly.log
ptp4l log:
/var/log/doca/firefly/ptp4l.log
phc2sys log:
/var/log/doca/firefly/phc2sys.log
SyncE log:
/var/log/doca/firefly/synced.log
Developer Logs
Firefly (PTP) Monitor -
/var/log/doca/firefly/firefly_monitor_dev.log
DOCA Firefly operates as a containerized DOCA Service and does not require separate packages for installation.
Nonetheless, the service offers enhanced debugging capabilities for the finalized configuration file. Detailed instructions on how to utilize these debugging features are provided in the relevant section of the NVIDIA DOCA Firefly Service Guide.
When troubleshooting container deployment issues, it is highly recommended to follow the deployment steps and tips in the "Review Container Deployment" section of the NVIDIA DOCA Container Deployment Guide.
Debugging config File
To debug the finalized configuration file used by Firefly, users can connect to the container as follows:
Open a shell session on the running container using the container ID:
$
sudo
crictlexec
-it <container-id
> /bin/bashOnce connected the to container, the finalized configuration file can be found under the
/tmp
directory using the same filename as the original configuration file.InfoMore information regarding the configuration files can be found under section "Ensuring and Debugging Correctness of Config File" in the service guide.
Pod is Marked as "Ready" and No Container is Listed
Error
When deploying the container, the pod's STATE is marked as Ready
, an image is listed, however no container can be seen running:
$ sudo
crictl pods
POD ID CREATED STATE NAME NAMESPACE ATTEMPT RUNTIME
06bd84c07537e 4 seconds ago Ready doca-firefly-my-dpu default 0 (default)
$ sudo
crictl images
IMAGE TAG IMAGE ID SIZE
k8s.gcr.io/pause 3.2 2a060e2e7101d 251kB
nvcr.io/nvidia/doca/doca_firefly 1.1.0-doca2.0.2 134cb22f34611 87.4MB
$ sudo
crictl ps
CONTAINER IMAGE CREATED STATE NAME ATTEMPT POD ID POD
Solution
In most cases, the container did start, but immediately exited. This could be checked using the following command:
$ sudo
crictl ps
-a
CONTAINER IMAGE CREATED STATE NAME ATTEMPT POD ID POD
556bb78281e1d 134cb22f34611 7 seconds ago Exited doca-firefly 1 06bd84c07537e doca-firefly-my-dpu
Should the container fail (i.e., state of Exited
) it is recommended to examine Firefly's main log at /var/log/doca/firefly/firefly.log
.
In addition, for a short period of time after termination, the container logs could also be viewed using the the container's ID:
$ sudo
crictl logs 556bb78281e1d
Starting DOCA Firefly - Version 1.1.0
...
Requested the following PTP interface: p10
Failed to find
interface "p10"
. Aborting
Custom Config File is Not Found
Error
When DOCA Firefly is deployed using a custom configuration file, a deployment error occurs and the following log message appears:
...
2023-09-07 14:04:23 - Firefly - Init - ERROR - Custom config file
not found: my_file.conf. Aborting
...
Solution
Check the custom file name written in the YAML file and make sure that you properly placed the file with that name under the /etc/firefly/
directory of the DPU.
Profile is Not Supported
Error
When DOCA Firefly is deployed, a deployment error occurs and the following log message appears:
...
2023-09-07 14:04:23 - Firefly - Init - ERROR - profile <name> is not supported. Aborting
...
Solution
Verify that the profile selected in the YAML file matches one of the supported profiles as listed in the profiles table.
The profile name is case sensitive. The name must be specified in lower-case letters.
PPS Capability is Missing
Error
When DOCA Firefly is deployed and configured to use the PPS
module, a deployment error occurs and the following log message appears:
...
2023-09-07 14:04:23 - Firefly - Init - INFO - Starting PPS configuration
2023-09-07 14:04:23 - Firefly - Init - WARNING - [-] PPS capability is missing, seems that the card doesn't support PPS
2023-09-07 14:04:23 - Firefly - Init - INFO - capabilities:
2023-09-07 14:04:23 - Firefly - Init - INFO - 50000000 maximum frequency adjustment (ppb)
2023-09-07 14:04:23 - Firefly - Init - INFO - 0 programmable alarms
2023-09-07 14:04:23 - Firefly - Init - INFO - 0 external time
stamp channels
2023-09-07 14:04:23 - Firefly - Init - INFO - 0 programmable periodic signals
2023-09-07 14:04:23 - Firefly - Init - INFO - 0 pulse per second
2023-09-07 14:04:23 - Firefly - Init - INFO - 0 programmable pins
2023-09-07 14:04:23 - Firefly - Init - INFO - 0 cross timestamping
...
Solution
This log indicates that the DPU hardware does not support PPS. However, PTP can still run on this hardware and you should see the line Running ptp4l
in the container log, indicating that PTP is running successfully.
Timed Out While Polling for Tx Timestamp
Error
When the BlueField is operating in DPU mode, DOCA Firefly gets stuck in a fault loop while waiting to receive the Tx timestamp events:
ptp4l[2912.797]: timed out while polling for tx timestamp
ptp4l[2912.797]: increasing tx_timestamp_timeout may correct this issue, but it is likely caused by a driver bug
ptp4l[2912.797]: port 1 (enp3s0f0s4): send sync failed
ptp4l[2923.528]: timed out while polling for tx timestamp
ptp4l[2923.528]: increasing tx_timestamp_timeout may correct this issue, but it is likely caused by a driver bug
ptp4l[2923.528]: port 1 (enp3s0f0s4): send sync failed
DOCA Firefly has a known gap leading to this error appearing once, after which ptp4l recovers from it. This section only covers the case in which there is a fault loop and no recovery occurs.
Solution
DOCA Firefly's configurations were already adjusted to accommodate for Tx port timestamping. For more information about the reason for this error and for the designed recovery mechanism from it, refer to the "Tx Timestamping Support on DPU Mode" section in DOCA Firefly Service Guide.
Warning - Time Jumped Backwards
Error
When using Firefly's Servo module, the following warning log message is encountered on start:
2024-01-01 14:04:23 - Firefly - SERVO - WARNING - Clock is going to jump backwards in time - this might have a system-wide impact
Solution
This warning messages indicates that the system's time jumped backwards with a value of at least one minute. This event is logged by Firefly given that such jumps might have system-wide implications. For more information, refer to NVIDIA DOCA Troubleshooting Guide.
Such jumps can only happen during Firefly's boot, before the Servo achieved an initial time synchronization with the reference clock.