Active-Standby Fronthaul Port Failover#

This page covers how to perform active-standby fronthaul port failover tests.

Test Configuration#

  • One active port and one redundant/standby port on the NIC:

    • Cell capacity is restricted to the BW supported by one NIC port (200Gbps).

    • In case of port failure, the L1 controller switches to the redundant/standby port and the cell should not stop during the port failover.

../../_images/active-standby-test-digram1.png

Test Procedure#

The following test procedure ensures the functionality in Aerial Connection Manager and verifies that the FH driver supports this requirement. To simulate port failure, a script on the FH switch enables and disables the specific port. The test can be done using cuBB with the FH switch or the actual E2E setup: For example, TestMAC <-> cuPHYController <-> FH switch (SN3750) <-> RU emulator.

  1. Configure the cuphycontroller.yaml file.

    Note

    The examples below assume the following:

    • Default port: ifname: ‘aerial00’; PCIe address: ‘0000:01:00.0’

    • Backup port: ifname: ‘aerial01’; pcie address: ‘0000:01:00.1’

    1. Add both NIC ports in the cuphycontroller.yaml file. The first port is used as the default port for C/U-plane traffic by setting the ‘nic’ in all cells ['cuphydriver_config']['cells'][]['nic'] as ‘0000:01:00.0’.

      cuphydriver_config:
      ...
      nics:
          - nic: '0000:01:00.0'
          mtu: 1514
          cpu_mbufs: 196608
          uplane_tx_handles: 64
          txq_count: 60
          rxq_count: 20
          txq_size: 8192
          rxq_size: 16384
          gpu: 0
          - nic: '0000:01:00.1'
          mtu: 1514
          cpu_mbufs: 196608
          uplane_tx_handles: 64
          txq_count: 60
          rxq_count: 20
          txq_size: 8192
          rxq_size: 16384
          gpu: 0
      
    2. Enable the ‘cus_port_failover’ flag in the cuphycontroller.yaml file.

      cuphydriver_config:
      ...
      cus_port_failover: 1
      
    3. Set the default port and backup port in $cuBB_SDK/cuPHY-CP/cuphyoam/src/cus_conn_mgr.cpp#L46 (The line number may be different due to release version). You will need to rebuild after the code change.

      std::string if_names[] = {"aerial00", "aerial01"}; // First one is the default port, the second one is backup port
      
  2. Start the cuphycontroller.

    The following is example output of the connection manager from the cuphycontroller console on startup:

    11:05:51.677624 WRN 56926 0 [OAM.CUSConnMgr] AerialCUSConnMgr started...
    11:05:51.677624 WRN 56926 0 [OAM.CUSConnMgr] Default port: 'aerial00',  backup port: 'aerial01'
    11:05:51.677688 WRN 56926 0 [OAM.CUSConnMgr] Interface 'aerial00' index is: 2
    11:05:51.677693 WRN 56926 0 [OAM.CUSConnMgr] Interface 'aerial01' index is: 5
    11:05:51.677693 WRN 56926 0 [OAM.CUSConnMgr] Default CUS port index is: 2
    11:05:51.678065 WRN 56926 0 [OAM.CUSConnMgr] Interface idx 2 PCIe address is: 0000:01:00.0
    11:05:51.678170 WRN 56926 0 [OAM.CUSConnMgr] Interface idx 5 PCIe address is: 0000:01:00.1
    11:05:51.678180 WRN 56926 0 [OAM.CUSConnMgr] Listening for link down events...
    11:05:51.678272 WRN 56926 0 [OAM.CUSConnMgr] Link up event detected on interface index: 2
    11:05:51.678280 WRN 56926 0 [OAM.CUSConnMgr] Link up event detected on interface index: 5
    

    The Aerial Connection Manager will monitor for port failure (i.e. link down events).

  1. Execute the FH port failover script on the FH switch:

    1. MAC address remapping

    2. Apply port active rules

    When port failure happens, Aerial Connection Manager will do the following:

    1. Change the FH port to the redundant/standby port.

    2. Reconfigure ptp4l to the redundant/standby port and restart the ptp4l daemon.

  2. Check if the CUS-plane switch completes and determine the impact on the C/U-plane (i.e. whether FH packets are dropped/early/on-time/late).

    The following is example output from the cuphycontroller console when a link down event is detected with the ‘aerial00’ default port:

    11:11:03.068607 WRN 56926 0 [OAM.CUSConnMgr] Link down event detected on interface index: 2
    11:11:03.195914 WRN 56926 0 [OAM.CUSConnMgr] ptp4l service restarted successfully.
    11:11:03.195916 WRN 56926 0 [OAM.CUSConnMgr] Successfully switch CUS port to interface aerial01
    11:11:03.195916 WRN 56926 0 [OAM.CUSConnMgr] CU Plane port failover took 109857 nanoseconds.
    11:11:03.195916 WRN 56926 0 [OAM.CUSConnMgr] CUS Plane port failover took 127309590 nanoseconds.
    

FH Switch Test Script#

The following test script is verified on the Spectrum SN3750 switch:

  • The default active NIC port is connected to the swp7 port on the switch.

  • The standby NIC port is connected to the swp8 port on the switch.

Use the test script as follows: sudo ./failover -port1 swp7 -port2 swp8 -iter 1 -interval 0.1 -wait 20

#!/usr/bin/python3

import sys
import os
import json
import subprocess
import argparse
import time
import datetime


def setup_arg_parser():
    cfg = argparse.ArgumentParser()
    cfg.add_argument("-port1", help="port1 (primary FH port)", required=True)
    cfg.add_argument("-port2", help="port2 (Failover FH port)", required=True)
    cfg.add_argument("-iter", help="number of ping pong iterations", required=True)
    cfg.add_argument("-interval", help="polling interval in seconds, e.g. 0.1", required=True)
    cfg.add_argument("-wait", help="wait between ping pong iterations", required=True)

    return cfg

def get_port_state(port):
    cmd = 'nv show interface {} link state --output json'.format(port)
    out = subprocess.check_output(cmd.split())
    j1 = json.loads(out)
    port_state = list(j1.keys())[0]

    return port_state

def get_port_ptp_counter(port):
    #cmd = 'nv show interface {} counters  ptp --output json'.format(port)
    cmd = 'ptpctl -j show interface ethernet {} counters'.format(port)
    out = subprocess.check_output(cmd.split())
    j1 = json.loads(out)
    delay_resp_tx_cnt = j1['delay-resp']['transmitted']

    return delay_resp_tx_cnt

def prepare_ports(port1, port2):
    # get port1 and port2 link state
    port1_state = get_port_state(port1)
    port2_state = get_port_state(port2)

    # if both ports are down, bring both up, and pick port1 to be brought down
    # if both ports are up, pick port1 to be brought down
    if port1_state == port2_state:
        if port1_state == 'down':
            os.system('ip link set up {}'.format(port1))
            os.system('ip link set up {}'.format(port2))
            time.sleep(3)
        port_to_fail = port1
        failover_port = port2
    else:
        # if port1 is up and port2 is down, bring up port2 and pick port1 to be down
        # if port1 is down and port2 is up, bring up port1 and pick port2 to be down
        if port1_state == 'up':
            os.system('ip link set up {}'.format(port2))
            time.sleep(3)
            port_to_fail = port1
            failover_port = port2
        else:
            os.system('ip link set up {}'.format(port1))
            time.sleep(3)
            port_to_fail = port2
            failover_port = port1

    print('Preparing ports: failover from {} to {}'.format(port_to_fail, failover_port))
    return port_to_fail, failover_port

def start_sequence(port_to_fail, failover_port, interval):
    # ensure both ports are up
    p1_state = get_port_state(port_to_fail)
    p2_state = get_port_state(failover_port)
    if p1_state == 'down' or p2_state == 'down':
        print('Not both ports are up, abort.')
        sys.exit(1)

    # take failover_port ptp counter snapshot
    delay_resp_tx_cnt = get_port_ptp_counter(failover_port)
    new_cnt = delay_resp_tx_cnt

    # bring down port_to_fail and log in syslog
    os.system('ip link set down {}'.format(port_to_fail))
    os.system('logger \"{} is shutdown\"'.format(port_to_fail))

    # update mac address translation: hack
    acl_path = '/etc/cumulus/acl/policy.d/50_nvue.rules'
    if port_to_fail == 'swp7':
        os.system('cp port2_active.rules {}'.format(acl_path))
    else:
        os.system('cp port1_active.rules {}'.format(acl_path))
    os.system('cl-acltool -i')
    os.system('logger \"Updated mac translation\"')

    # start polling for failover port counter change
    print(datetime.datetime.now())
    while new_cnt == delay_resp_tx_cnt:
        time.sleep(float(interval))
        new_cnt = get_port_ptp_counter(failover_port)
    print(datetime.datetime.now())

    # log counter change to syslog
    os.system('logger \"{} ptp is locked\"'.format(failover_port))

if __name__ == '__main__':
    parser = setup_arg_parser()
    cfg = parser.parse_args(sys.argv[1:])

    port1 = cfg.port1
    port2 = cfg.port2
    for i in range(int(cfg.iter)):
        port_to_fail, failover_port = prepare_ports(port1, port2)
        msg = 'Iteration {} start, failover from {} to {}'.format(i, port_to_fail, failover_port)
        print(msg)
        os.system('logger \"{}\"'.format(msg))
        start_sequence(port_to_fail, failover_port, cfg.interval)
        print('Iteration {} finished, waiting to start next sequence'.format(i))
        time.sleep(int(cfg.wait))
        port1 = port_to_fail
        port2 = failover_port
    os.system('ip link set up {}'.format(port1))
FH switch rule changes
swp3: RU emulator NIC port is connected to the switch's swp3 port.
94:6d:ae:f5:ab:98 is the MAC address of the the default port.
94:6d:ae:f5:ab:99 is the MAC address of the standby port.
port2_active.rules
# Auto-generated by NVUE!
# Any local modifications will prevent NVUE from re-generating this file.
# md5sum: 4e39a3b53931b61cc128e47ce2ca6d2c


[iptables]


[ip6tables]


[ebtables]



## ACL ru-in in dir inbound on interface swp3 ##
# rule-id #1:  #
-t nat -A PREROUTING -i swp3 --comment rule_id:1,acl_name:ru-in,dir:inbound,interface_id:swp3 -d 94:6d:ae:f5:ab:98/ff:ff:ff:ff:ff:ff -j dnat --to-destination 94:6d:ae:f5:ab:99

## ACL ru-out in dir outbound on interface swp3 ##
# rule-id #1:  #
-t nat -A POSTROUTING -o swp3 --comment rule_id:1,acl_name:ru-out,dir:outbound,interface_id:swp3 -s 94:6d:ae:f5:ab:98/ff:ff:ff:ff:ff:ff -j snat --to-source 94:6d:ae:f5:ab:98
# rule-id #2:  #
-t nat -A POSTROUTING -o swp3 --comment rule_id:2,acl_name:ru-out,dir:outbound,interface_id:swp3 -s 94:6d:ae:f5:ab:99/ff:ff:ff:ff:ff:ff -j snat --to-source 94:6d:ae:f5:ab:98
port1_active.rules
# Auto-generated by NVUE!
# Any local modifications will prevent NVUE from re-generating this file.
# md5sum: 4e39a3b53931b61cc128e47ce2ca6d2c


[iptables]


[ip6tables]


[ebtables]



## ACL ru-in in dir inbound on interface swp3 ##
# rule-id #1:  #
-t nat -A PREROUTING -i swp3 --comment rule_id:1,acl_name:ru-in,dir:inbound,interface_id:swp3 -d 94:6d:ae:f5:ab:98/ff:ff:ff:ff:ff:ff -j dnat --to-destination 94:6d:ae:f5:ab:98

## ACL ru-out in dir outbound on interface swp3 ##
# rule-id #1:  #
-t nat -A POSTROUTING -o swp3 --comment rule_id:1,acl_name:ru-out,dir:outbound,interface_id:swp3 -s 94:6d:ae:f5:ab:98/ff:ff:ff:ff:ff:ff -j snat --to-source 94:6d:ae:f5:ab:98
# rule-id #2:  #
-t nat -A POSTROUTING -o swp3 --comment rule_id:2,acl_name:ru-out,dir:outbound,interface_id:swp3 -s 94:6d:ae:f5:ab:99/ff:ff:ff:ff:ff:ff -j snat --to-source 94:6d:ae:f5:ab:98