DOCA Documentation v2.8.0
DOCA 2.8.0

HBN Service Troubleshooting

The HBN container starts as init-sfs and should transition to doca-hbn within 2 minutes as can be seen using crictl ps. But sometimes it may remain as init-sfs.

This can happen if interface p0_if is missing. Run the command ip -br link show dev p0_if in BlueField and inside the container to check if p0_if is present or not. If its missing, make sure the firmware is upgraded to the latest version. Perform BlueField system-level reset for the new firmware to take effect.

In general, the host can use any interface manager to manage host interfaces belonging to BlueField. When the host uses an interface manager other than Netplan or NetworkManager, some ports may remain down after BlueField reboot.

Apply the following workaround if interfaces stay down:

  1. Restart openibd:

    Copy
    Copied!
                

    systemctl restart openibd

  2. Recreate SR-IOV interfaces if they are needed.

  3. Replay interface config. For example:

    • If using ifupdown2:

      Copy
      Copied!
                  

      ifreload -a 

    • If using Netplan:

      Copy
      Copied!
                  

      netplan apply

One of the main causes of a BGP session not getting established is a mismatch in MTU configuration. Make sure the MTU on all interfaces is the same. For example, if BGP is failing on p0, check and verify that there is a matching MTU value for p0, p0_if_r, p0_if, and the remote peer of p0.

The HBN container image can be collected from /etc/image-version using the hbn-support command inside container:

Copy
Copied!
            

root@bf2:/tmp# hbn-support Please send /var/support/hbn_support_doca-hbn-service-bf2-s15-1-ipmi_20240820_211214.txz to Cumulus support.

The generated dump would be available under /var/support in the HBN container and should contain any process core dump and log files. The generated cores can be found under /var/support/core and collected by hbn-support. The /var/support directory is also mounted on the BlueField Arm side at /var/lib/hbn/var/support.

For BlueField, the BFB version can be checked from /etc/mlnx-release.

The firmware version can be collect from mlxfwmanager.

BlueField support dump can be collect using the sos command:

Copy
Copied!
            

root@bf2:/tmp/#sos report -a --all-logs --batch

Example output:

Copy
Copied!
            

sos report (version 4.8.0)   This command will collect system configuration and diagnostic information from this Ubuntu system. ... ... Finished running plugins   Creating compressed archive...   Your sos report has been generated and saved in: /tmp/sosreport-bf2-s15-1-ipmi-2024-08-20-cpdvegw.tar.xz   Size 19.37MiB Owner root sha256 0890a855623a1a2dd5089c9cd6d57d81e71f3805ac06c2d9fc0dab556ccd5ffc   Please send this file to your support representative.

To troubleshoot flows going through SFC interfaces, the first step is to disable the nl2doca service in the HBN container:

Copy
Copied!
            

root@bf2:/tmp# supervisorctl stop nl2doca nl2doca: stopped

Stopping nl2doca effectively stops hardware offloading and switches to software forwarding. All packets would appear on tcpdump capture on BlueField interfaces.

tcpdump can be performed on SF interfaces as well as VLAN, VXLAN, and uplinks to determine where a packet gets dropped or which flow a packet is taking.

The following steps can be used to make sure the nl2doca daemon is up and running:

  1. Make sure there are no errors in the nl2doca log file at /var/log/hbn/nl2docad.log.

  2. To check the status of the nl2doca daemon under supervisor, run:

    Copy
    Copied!
                

    supervisorctl status nl2doca

  3. Use ps to check that the actual nl2doca process is running:

    Copy
    Copied!
                

    ps -eaf | grep nl2doca root 18 1 0 06:31 ? 00:00:00 /bin/bash /usr/bin/nl2doca-docker-start root 1437 18 0 06:31 ? 00:05:49 /usr/sbin/nl2docad

  4. The core file should be in /var/support/core/.

  5. Check if the /cumulus/nl2docad/run/stats/punt​ is accessible. Otherwise, nl2doca may be stuck and should be restarted:

    Copy
    Copied!
                

    supervisorctl restart nl2doca

If a certain traffic flow does not work as expected, disable nl2doca (i.e., disable hardware offloading):

Copy
Copied!
            

supervisorctl stop nl2doca​

​With hardware offloading disabled, you can confirm it is an offloading issue if the traffic starts working. If it is not an offloading issue, use tcpdump on various interfaces to see where the packet gets dropped.

Offloaded entries can be checked in following files, which contain the programming status of every IP prefix and MAC address known to system.

  • Bridge entries are available in the file /cumulus/nl2docad/run/software-tables/17​. It includes all the MAC addresses in the system including local and remote MAC addresses.

    Example format:

    Copy
    Copied!
                

    - flow-entry: 0xaaab0cef4190​ flow-pattern:​ fid: 112​ dst mac: 00:00:5e:00:01:01​ flow-actions:​ SET VRF: 2​         OUTPUT-PD-PORT: 20(TO_RTR_INTF)         STATS:​ pkts: 1719​ bytes: 191286​

  • Router entries are available in the file /cumulus/nl2docad/run/software-tables/18​. It includes all the IP prefixes known to the system.

    Example format for Entry with ECMP:

    Copy
    Copied!
                

    Entry with ECMP: - flow-entry: 0xaaaada723700 flow-pattern: IPV6: LPM VRF: 0 destination-ip: ::/0 flow-actions : ECMP: 2 STATS: pkts: 0 bytes: 0​   Entry without ECMP: - flow-entry: 0xaaaada7e1400     flow-pattern:        IPV4: LPM        VRF: 0        destination-ip: 60.1.0.93/32     flow-actions :         SET FID: 200         SMAC: 00:04:4b:a7:88:00         DMAC: 00:03:00:08:00:12         OUTPUT-PD-PORT: 19(TO_BR_INTF)    STATS:        pkts: 0        bytes: 0

  • ECMP entries are available in the file /cumulus/nl2docad/run/software-tables/19​. It includes all the next hops in the system.

    Example format:

    Copy
    Copied!
                

    - ECMP: 2 ref-count: 2 num-next-hops: 2 entries: - { index: 0, fid: 4100, src mac: 'b8:ce:f6:99:49:6a', dst mac: '00:02:00:00:00:0a' } - { index: 1, fid: 4101, src mac: 'b8:ce:f6:99:49:6b', dst mac: '00:02:00:00:00:0e' }

To check counters for packets going to the kernel, run:

Copy
Copied!
            

cat /cumulus/nl2docad/run/stats/punt ​PUNT miss pkts:3154 bytes:312326 PUNT miss drop pkts:0 bytes:0 PUNT control pkts:31493 bytes:2853186 PUNT control drop pkts:0 bytes:0 ACL PUNT pkts:68 bytes:7364 ACL drop pkts:0 bytes:0

For a specific type of packet flow, programming can be referenced in block specific files. The typical flow is as follows:

For example, to check L2 EVPN ENCAP flows for remote MAC 8a:88:d0:b1:92:b1 on port pf0vf0_if, the basic offload flow should look as follows: RxPort (pf0vf0_if) -> BR (Overlay) -> RTR (Underlay) -> BR (Underlay) -> TxPort​ (one of the uplink p0_if or p1_if based on ECMP hash).

Step-by-step procedure:

  1. Navigate to the interface file /cumulus/nl2docad/run/software-tables/20.

  2. Check for the RxPort (pf0vf0_if):

    Copy
    Copied!
                

    Interface: pf0vf0_if​ PD PORT: 6​ HW PORT: 16 NETDEV PORT: 11 Bridge-id: 61​ Untagged FID: 112​

    FID 112 is given to the receive port​.

  3. Check the bridge table file /cumulus/nl2docad/run/software-tables/17 with destination MAC 8a:88:d0:b1:92:b1 and FID 112:

    Copy
    Copied!
                

    flow-pattern:​ fid: 112​ dst mac: 8a:88:d0:b1:92:b1​ flow-actions:​ VXLAN ENCAP:​ ENCAP dst ip: 6.0.0.26​ ENCAP vni id: 1000112​ SET VRF: 0​ OUTPUT-PD-PORT: 20(TO_RTR_INTF)​ STATS:​ pkts: 100​ bytes: 10200​

  4. Check the router table file /cumulus/nl2docad/run/software-tables/18 with destination IP 6.0.0.26 and VRF 0:

    Copy
    Copied!
                

    flow-pattern:​ IPV4: LPM​ VRF: 0​ ip dst: 6.0.0.26/32​ flow-actions :​ ECMP: 1​ OUTPUT PD PORT: 2(TO_BR_INTF)​ STATS:​ pkts: 300​ bytes: 44400​

  5. Check the ECMP table file /cumulus/nl2docad/run/software-tables/19 with ECMP 1:

    Copy
    Copied!
                

    - ECMP: 1​ ref-count: 7​      num-next-hops: 2      entries:​ - { index: 0, fid: 4100, src mac: 'b8:ce:f6:99:49:6a', dst mac: '00:02:00:00:00:2f' }​ - { index: 1, fid: 4115, src mac: 'b8:ce:f6:99:49:6b', dst mac: '00:02:00:00:00:33' }​

  6. The ECMP hash calculation picks one of these paths for next-hop rewrite. Check bridge table file for them (fid=4100, dst mac: 00:02:00:00:00:2f or fid=4115, dst mac: 00:02:00:00:00:33):

    Copy
    Copied!
                

    flow-pattern:​ fid: 4100​ dst mac: 00:02:00:00:00:2f​ flow-actions:​ OUTPUT-PD-PORT: 36(p0_if)​ STATS:​ pkts: 1099​ bytes: 162652​

    This will show the packet going out on the uplink.

To check the status of the NVUE daemon, run:

Copy
Copied!
            

supervisorctl status nvued

To restart the NVUE daemon, run:

Copy
Copied!
            

supervisorctl restart nvued

© Copyright 2024, NVIDIA. Last updated on Aug 21, 2024.