What can I help you with?
NVIDIA BlueField Platform Software Troubleshooting Guide

Firefly

The DOCA Firefly Service delivers Precision Time Protocol (PTP) based time synchronization for the BlueField DPU.

PTP is a protocol designed to synchronize clocks within a network. When paired with hardware support, PTP can achieve sub-microsecond accuracy, significantly surpassing the typical precision of Network Time Protocol (NTP). PTP functionality is managed across both kernel and user space, with the ptp4l program handling PTP boundary and ordinary clocks. With hardware time stamping, ptp4l ensures synchronization of the PTP hardware clock to the master clock.

For further details about DOCA Firefly, please consult the relevant NVIDIA DOCA Firefly Service Guide.

Firmware Settings

Query status of the Real Time Clock (RTC)

To check the status of the RTC on the DPU, use the following command:

Copy
Copied!
            

$ sudo mlxconfig -d 03:00.0 q | grep REAL_TIME_CLOCK_ENABLE # Example output REAL_TIME_CLOCK_ENABLE True(1)


Enabling the Real Time Clock (RTC)

To enable RTC, run:

Copy
Copied!
            

$ sudo mlxconfig -d 03:00.0 set REAL_TIME_CLOCK_ENABLE=1

A graceful shutdown and power cycle of the DPU are required for the changes to take effect.

Open vSwitch (OVS) Configuration

Check Hardware Offload Support

To verify if hardware offload is enabled, run:

Copy
Copied!
            

$ sudo ovs-vsctl get Open_vSwitch . other_config | grep hw-offload # Example output         {hw-offload="true"}


Enable Hardware Offload Support

  1. Activate hardware offloading, run:

    Copy
    Copied!
                

    $ sudo ovs-vsctl set Open_vSwitch . other_config:hw-offload=true;

  2. Restart the OVS service:

    Copy
    Copied!
                

    $ sudo /etc/init.d/openvswitch-switch restart

Examine Switch Settings

To examine the current switch settings, run:

Copy
Copied!
            

$ sudo ovs-vsctl show Bridge uplink Port pf0hpf Interface pf0hpf Port en3f0pf0sf4 Interface en3f0pf0sf4 Port p0 Interface p0 Port uplink Interface uplink type: internal


Add New Bridge / Port

To add a new bridge or port, run:

Copy
Copied!
            

$ sudo ovs-vsctl add-br <bridge name> $ sudo ovs-vsctl add-port <bridge name> <port name>

Example for DOCA Firefly Deployment

Copy
Copied!
            

$ sudo ovs-vsctl add-br uplink $ sudo ovs-vsctl add-port uplink p0 $ sudo ovs-vsctl add-port uplink en3f0pf0sf4 # This port is needed to ensure we have traffic host<->network as well $ sudo ovs-vsctl add-port uplink pf0hpf

Network Interface Configuration

Enable Hardware Tx Port Timestamping

Copy
Copied!
            

$ sudo ethtool --set-priv-flags enp3s0f0s4 tx_port_ts on


Configure IP Address for Interface

Copy
Copied!
            

$ sudo ifconfig enp3s0f0s4 <ip-addr> up

Container Runtime Commands

When deploying a new container, it is recommended to follow this procedure to ensure the successful execution of each step throughout the deployment process:

View Currently Active Pods and their IDs

Copy
Copied!
            

$ sudo crictl pods

Info

It may take up to 20 seconds for the pod to start.

When deploying a new container, look for a corresponding entry line in the command's output:

Copy
Copied!
            

POD ID CREATED STATE NAME NAMESPACE ATTEMPT RUNTIME 06bd84c07537e 4 seconds ago Ready doca-firefly-my-dpu default 0 (default)


Review Kubelet Logs

If no matching line appears, it is recommended to check the Kubelet logs for more details about the error:

Copy
Copied!
            

$ sudo journalctl -u kubelet --since -5m

Once the issue is resolved, proceed to the next steps.

Verify Download of Container Image from NGC

Verify that the container image is successfully downloaded from NGC into the DPU's container registry (download time may vary based on the size of the container image):

Copy
Copied!
            

$ sudo crictl images

Example output:

Copy
Copied!
            

IMAGE TAG IMAGE ID SIZE k8s.gcr.io/pause 3.2 2a060e2e7101d 251kB nvcr.io/nvidia/doca/doca_firefly 1.1.0-doca2.0.2 134cb22f34611 87.4MB


View Currently Active Containers

View currently active containers and their IDs:

Copy
Copied!
            

$ sudo crictl ps

Once again, find a corresponding entry line for the deployed container (boot time may vary depending on the container's image size):

Copy
Copied!
            

CONTAINER IMAGE CREATED STATE NAME ATTEMPT POD ID POD b505a05b7dc23 134cb22f34611 4 minutes ago Running doca-firefly 0 06bd84c07537e doca-firefly-my-dpu

In case of failure to find a matching container, review the list of all recent container deployments:

Copy
Copied!
            

$ sudo crictl ps -a

It is possible that the container encountered an error during boot and exited right away:

Copy
Copied!
            

CONTAINER IMAGE CREATED STATE NAME ATTEMPT POD ID POD de2361ec15b61 134cb22f34611 1 second ago Exited doca-firefly 1 4aea5f5adc91d doca-firefly-my-dpu


Review Logs of a Container

During the container's runtime, and for a short timespan after it exits, you can view the containers logs that were printed to the standard output:

Copy
Copied!
            

$ sudo crictl logs <container-id>

In this case, the user can learn from the log that the wrong configuration was passed to the container:

Copy
Copied!
            

$ sudo crictl logs de2361ec15b61 Starting DOCA Firefly - Version 1.1.0 ... Requested the following PTP interface: p10 Failed to find interface "p10". Aborting

Info

For additional information and guides on using crictl, refer to the Debugging Kubernetes Nodes with crictl.


Stop a Running Container

The recommended way to stop a pod and its containers is as follows:

  1. Delete the .yaml configuration file for Kubelet to stop the pod:

    Copy
    Copied!
                

    $ rm /etc/kubelet.d/<file name>.yaml

  2. Stop the pod directly (only if it still shows "Ready"):

    Copy
    Copied!
                

    $ sudo crictl stopp <pod-id>

  3. Once the pod stops, it may also be necessary to stop the container itself:

    Copy
    Copied!
                

    $ sudo crictl stop <container-id>

DOCA Firefly generates multiple log files, each corresponding to a specific module:

Runtime (Administrator) Logs

  • Main container log: /var/log/doca/firefly/firefly.log

  • ptp4l log: /var/log/doca/firefly/ptp4l.log

  • phc2sys log: /var/log/doca/firefly/phc2sys.log

  • SyncE log: /var/log/doca/firefly/synced.log

Developer Logs

  • Firefly (PTP) Monitor - /var/log/doca/firefly/firefly_monitor_dev.log

DOCA Firefly operates as a containerized DOCA Service and does not require separate packages for installation.

Nonetheless, the service offers enhanced debugging capabilities for the finalized configuration file. Detailed instructions on how to utilize these debugging features are provided in the relevant section of the NVIDIA DOCA Firefly Service Guide.

When troubleshooting container deployment issues, it is highly recommended to follow the deployment steps and tips in the "Review Container Deployment" section of the NVIDIA DOCA Container Deployment Guide.

Debugging config File

To debug the finalized configuration file used by Firefly, users can connect to the container as follows:

  1. Open a shell session on the running container using the container ID:

    Copy
    Copied!
                

    $ sudo crictl exec -it <container-id> /bin/bash

  2. Once connected the to container, the finalized configuration file can be found under the /tmp directory using the same filename as the original configuration file.

    Info

    More information regarding the configuration files can be found under section "Ensuring and Debugging Correctness of Config File" in the service guide.

Pod is Marked as "Ready" and No Container is Listed

Error

When deploying the container, the pod's STATE is marked as Ready, an image is listed, however no container can be seen running:

Copy
Copied!
            

$ sudo crictl pods POD ID CREATED STATE NAME NAMESPACE ATTEMPT RUNTIME 06bd84c07537e 4 seconds ago Ready doca-firefly-my-dpu default 0 (default)   $ sudo crictl images IMAGE TAG IMAGE ID SIZE k8s.gcr.io/pause 3.2 2a060e2e7101d 251kB nvcr.io/nvidia/doca/doca_firefly 1.1.0-doca2.0.2 134cb22f34611 87.4MB   $ sudo crictl ps CONTAINER IMAGE CREATED STATE NAME ATTEMPT POD ID POD


Solution

In most cases, the container did start, but immediately exited. This could be checked using the following command:

Copy
Copied!
            

$ sudo crictl ps -a CONTAINER IMAGE CREATED STATE NAME ATTEMPT POD ID POD 556bb78281e1d 134cb22f34611 7 seconds ago Exited doca-firefly 1                  06bd84c07537e doca-firefly-my-dpu

Should the container fail (i.e., state of Exited) it is recommended to examine Firefly's main log at /var/log/doca/firefly/firefly.log.

In addition, for a short period of time after termination, the container logs could also be viewed using the the container's ID:

Copy
Copied!
            

$ sudo crictl logs 556bb78281e1d Starting DOCA Firefly - Version 1.1.0 ... Requested the following PTP interface: p10 Failed to find interface "p10". Aborting

Custom Config File is Not Found

Error

When DOCA Firefly is deployed using a custom configuration file, a deployment error occurs and the following log message appears:

Copy
Copied!
            

... 2023-09-07 14:04:23 - Firefly - Init - ERROR - Custom config file not found: my_file.conf. Aborting ...


Solution

Check the custom file name written in the YAML file and make sure that you properly placed the file with that name under the /etc/firefly/ directory of the DPU.

Profile is Not Supported

Error

When DOCA Firefly is deployed, a deployment error occurs and the following log message appears:

Copy
Copied!
            

... 2023-09-07 14:04:23 - Firefly - Init - ERROR - profile <name> is not supported. Aborting ...


Solution

Verify that the profile selected in the YAML file matches one of the supported profiles as listed in the profiles table.

Note

The profile name is case sensitive. The name must be specified in lower-case letters.

PPS Capability is Missing

Error

When DOCA Firefly is deployed and configured to use the PPS module, a deployment error occurs and the following log message appears:

Copy
Copied!
            

... 2023-09-07 14:04:23 - Firefly - Init - INFO - Starting PPS configuration 2023-09-07 14:04:23 - Firefly - Init - WARNING - [-] PPS capability is missing, seems that the card doesn't support PPS 2023-09-07 14:04:23 - Firefly - Init - INFO - capabilities: 2023-09-07 14:04:23 - Firefly - Init - INFO - 50000000 maximum frequency adjustment (ppb) 2023-09-07 14:04:23 - Firefly - Init - INFO - 0 programmable alarms 2023-09-07 14:04:23 - Firefly - Init - INFO - 0 external time stamp channels 2023-09-07 14:04:23 - Firefly - Init - INFO - 0 programmable periodic signals 2023-09-07 14:04:23 - Firefly - Init - INFO - 0 pulse per second 2023-09-07 14:04:23 - Firefly - Init - INFO - 0 programmable pins 2023-09-07 14:04:23 - Firefly - Init - INFO - 0 cross timestamping ...


Solution

This log indicates that the DPU hardware does not support PPS. However, PTP can still run on this hardware and you should see the line Running ptp4l in the container log, indicating that PTP is running successfully.

Timed Out While Polling for Tx Timestamp

Error

When the BlueField is operating in DPU mode, DOCA Firefly gets stuck in a fault loop while waiting to receive the Tx timestamp events:

Copy
Copied!
            

ptp4l[2912.797]: timed out while polling for tx timestamp ptp4l[2912.797]: increasing tx_timestamp_timeout may correct this issue, but it is likely caused by a driver bug ptp4l[2912.797]: port 1 (enp3s0f0s4): send sync failed ptp4l[2923.528]: timed out while polling for tx timestamp ptp4l[2923.528]: increasing tx_timestamp_timeout may correct this issue, but it is likely caused by a driver bug ptp4l[2923.528]: port 1 (enp3s0f0s4): send sync failed

Info

DOCA Firefly has a known gap leading to this error appearing once, after which ptp4l recovers from it. This section only covers the case in which there is a fault loop and no recovery occurs.


Solution

DOCA Firefly's configurations were already adjusted to accommodate for Tx port timestamping. For more information about the reason for this error and for the designed recovery mechanism from it, refer to the "Tx Timestamping Support on DPU Mode" section in DOCA Firefly Service Guide.

Warning - Time Jumped Backwards

Error

When using Firefly's Servo module, the following warning log message is encountered on start:

Copy
Copied!
            

2024-01-01 14:04:23 - Firefly - SERVO - WARNING - Clock is going to jump backwards in time - this might have a system-wide impact


Solution

This warning messages indicates that the system's time jumped backwards with a value of at least one minute. This event is logged by Firefly given that such jumps might have system-wide implications. For more information, refer to NVIDIA DOCA Troubleshooting Guide.

Such jumps can only happen during Firefly's boot, before the Servo achieved an initial time synchronization with the reference clock.

© Copyright 2024, NVIDIA. Last updated on Nov 12, 2024.