Active-Standby Fronthaul Port Failover#
This page covers how to perform active-standby fronthaul port failover tests.
Test Configuration#
One active port and one redundant/standby port on the NIC:
Cell capacity is restricted to the BW supported by one NIC port (200Gbps).
In case of port failure, the L1 controller switches to the redundant/standby port and the cell should not stop during the port failover.

Test Procedure#
The following test procedure ensures the functionality in Aerial Connection Manager and verifies that the FH driver supports this requirement. To simulate port failure, a script on the FH switch enables and disables the specific port. The test can be done using cuBB with the FH switch or the actual E2E setup: For example, TestMAC <-> cuPHYController <-> FH switch (SN3750) <-> RU emulator.
Configure the
cuphycontroller.yaml
file.Note
The examples below assume the following:
Default port: ifname: ‘aerial00’; PCIe address: ‘0000:01:00.0’
Backup port: ifname: ‘aerial01’; pcie address: ‘0000:01:00.1’
Add both NIC ports in the
cuphycontroller.yaml
file. The first port is used as the default port for C/U-plane traffic by setting the ‘nic’ in all cells['cuphydriver_config']['cells'][]['nic']
as ‘0000:01:00.0’.cuphydriver_config: ... nics: - nic: '0000:01:00.0' mtu: 1514 cpu_mbufs: 196608 uplane_tx_handles: 64 txq_count: 60 rxq_count: 20 txq_size: 8192 rxq_size: 16384 gpu: 0 - nic: '0000:01:00.1' mtu: 1514 cpu_mbufs: 196608 uplane_tx_handles: 64 txq_count: 60 rxq_count: 20 txq_size: 8192 rxq_size: 16384 gpu: 0
Enable the ‘cus_port_failover’ flag in the
cuphycontroller.yaml
file.cuphydriver_config: ... cus_port_failover: 1
Set the default port and backup port in
$cuBB_SDK/cuPHY-CP/cuphyoam/src/cus_conn_mgr.cpp#L46
(The line number may be different due to release version). You will need to rebuild after the code change.std::string if_names[] = {"aerial00", "aerial01"}; // First one is the default port, the second one is backup port
Start the cuphycontroller.
The following is example output of the connection manager from the cuphycontroller console on startup:
11:05:51.677624 WRN 56926 0 [OAM.CUSConnMgr] AerialCUSConnMgr started... 11:05:51.677624 WRN 56926 0 [OAM.CUSConnMgr] Default port: 'aerial00', backup port: 'aerial01' 11:05:51.677688 WRN 56926 0 [OAM.CUSConnMgr] Interface 'aerial00' index is: 2 11:05:51.677693 WRN 56926 0 [OAM.CUSConnMgr] Interface 'aerial01' index is: 5 11:05:51.677693 WRN 56926 0 [OAM.CUSConnMgr] Default CUS port index is: 2 11:05:51.678065 WRN 56926 0 [OAM.CUSConnMgr] Interface idx 2 PCIe address is: 0000:01:00.0 11:05:51.678170 WRN 56926 0 [OAM.CUSConnMgr] Interface idx 5 PCIe address is: 0000:01:00.1 11:05:51.678180 WRN 56926 0 [OAM.CUSConnMgr] Listening for link down events... 11:05:51.678272 WRN 56926 0 [OAM.CUSConnMgr] Link up event detected on interface index: 2 11:05:51.678280 WRN 56926 0 [OAM.CUSConnMgr] Link up event detected on interface index: 5
The Aerial Connection Manager will monitor for port failure (i.e. link down events).
Execute the FH port failover script on the FH switch:
MAC address remapping
Apply port active rules
When port failure happens, Aerial Connection Manager will do the following:
Change the FH port to the redundant/standby port.
Reconfigure ptp4l to the redundant/standby port and restart the ptp4l daemon.
Check if the CUS-plane switch completes and determine the impact on the C/U-plane (i.e. whether FH packets are dropped/early/on-time/late).
The following is example output from the cuphycontroller console when a link down event is detected with the ‘aerial00’ default port:
11:11:03.068607 WRN 56926 0 [OAM.CUSConnMgr] Link down event detected on interface index: 2 11:11:03.195914 WRN 56926 0 [OAM.CUSConnMgr] ptp4l service restarted successfully. 11:11:03.195916 WRN 56926 0 [OAM.CUSConnMgr] Successfully switch CUS port to interface aerial01 11:11:03.195916 WRN 56926 0 [OAM.CUSConnMgr] CU Plane port failover took 109857 nanoseconds. 11:11:03.195916 WRN 56926 0 [OAM.CUSConnMgr] CUS Plane port failover took 127309590 nanoseconds.
FH Switch Test Script#
The following test script is verified on the Spectrum SN3750 switch:
The default active NIC port is connected to the swp7 port on the switch.
The standby NIC port is connected to the swp8 port on the switch.
Use the test script as follows: sudo ./failover -port1 swp7 -port2 swp8 -iter 1 -interval 0.1 -wait 20
#!/usr/bin/python3
import sys
import os
import json
import subprocess
import argparse
import time
import datetime
def setup_arg_parser():
cfg = argparse.ArgumentParser()
cfg.add_argument("-port1", help="port1 (primary FH port)", required=True)
cfg.add_argument("-port2", help="port2 (Failover FH port)", required=True)
cfg.add_argument("-iter", help="number of ping pong iterations", required=True)
cfg.add_argument("-interval", help="polling interval in seconds, e.g. 0.1", required=True)
cfg.add_argument("-wait", help="wait between ping pong iterations", required=True)
return cfg
def get_port_state(port):
cmd = 'nv show interface {} link state --output json'.format(port)
out = subprocess.check_output(cmd.split())
j1 = json.loads(out)
port_state = list(j1.keys())[0]
return port_state
def get_port_ptp_counter(port):
#cmd = 'nv show interface {} counters ptp --output json'.format(port)
cmd = 'ptpctl -j show interface ethernet {} counters'.format(port)
out = subprocess.check_output(cmd.split())
j1 = json.loads(out)
delay_resp_tx_cnt = j1['delay-resp']['transmitted']
return delay_resp_tx_cnt
def prepare_ports(port1, port2):
# get port1 and port2 link state
port1_state = get_port_state(port1)
port2_state = get_port_state(port2)
# if both ports are down, bring both up, and pick port1 to be brought down
# if both ports are up, pick port1 to be brought down
if port1_state == port2_state:
if port1_state == 'down':
os.system('ip link set up {}'.format(port1))
os.system('ip link set up {}'.format(port2))
time.sleep(3)
port_to_fail = port1
failover_port = port2
else:
# if port1 is up and port2 is down, bring up port2 and pick port1 to be down
# if port1 is down and port2 is up, bring up port1 and pick port2 to be down
if port1_state == 'up':
os.system('ip link set up {}'.format(port2))
time.sleep(3)
port_to_fail = port1
failover_port = port2
else:
os.system('ip link set up {}'.format(port1))
time.sleep(3)
port_to_fail = port2
failover_port = port1
print('Preparing ports: failover from {} to {}'.format(port_to_fail, failover_port))
return port_to_fail, failover_port
def start_sequence(port_to_fail, failover_port, interval):
# ensure both ports are up
p1_state = get_port_state(port_to_fail)
p2_state = get_port_state(failover_port)
if p1_state == 'down' or p2_state == 'down':
print('Not both ports are up, abort.')
sys.exit(1)
# take failover_port ptp counter snapshot
delay_resp_tx_cnt = get_port_ptp_counter(failover_port)
new_cnt = delay_resp_tx_cnt
# bring down port_to_fail and log in syslog
os.system('ip link set down {}'.format(port_to_fail))
os.system('logger \"{} is shutdown\"'.format(port_to_fail))
# update mac address translation: hack
acl_path = '/etc/cumulus/acl/policy.d/50_nvue.rules'
if port_to_fail == 'swp7':
os.system('cp port2_active.rules {}'.format(acl_path))
else:
os.system('cp port1_active.rules {}'.format(acl_path))
os.system('cl-acltool -i')
os.system('logger \"Updated mac translation\"')
# start polling for failover port counter change
print(datetime.datetime.now())
while new_cnt == delay_resp_tx_cnt:
time.sleep(float(interval))
new_cnt = get_port_ptp_counter(failover_port)
print(datetime.datetime.now())
# log counter change to syslog
os.system('logger \"{} ptp is locked\"'.format(failover_port))
if __name__ == '__main__':
parser = setup_arg_parser()
cfg = parser.parse_args(sys.argv[1:])
port1 = cfg.port1
port2 = cfg.port2
for i in range(int(cfg.iter)):
port_to_fail, failover_port = prepare_ports(port1, port2)
msg = 'Iteration {} start, failover from {} to {}'.format(i, port_to_fail, failover_port)
print(msg)
os.system('logger \"{}\"'.format(msg))
start_sequence(port_to_fail, failover_port, cfg.interval)
print('Iteration {} finished, waiting to start next sequence'.format(i))
time.sleep(int(cfg.wait))
port1 = port_to_fail
port2 = failover_port
os.system('ip link set up {}'.format(port1))
FH switch rule changes
swp3: RU emulator NIC port is connected to the switch's swp3 port.
94:6d:ae:f5:ab:98 is the MAC address of the the default port.
94:6d:ae:f5:ab:99 is the MAC address of the standby port.
port2_active.rules
# Auto-generated by NVUE!
# Any local modifications will prevent NVUE from re-generating this file.
# md5sum: 4e39a3b53931b61cc128e47ce2ca6d2c
[iptables]
[ip6tables]
[ebtables]
## ACL ru-in in dir inbound on interface swp3 ##
# rule-id #1: #
-t nat -A PREROUTING -i swp3 --comment rule_id:1,acl_name:ru-in,dir:inbound,interface_id:swp3 -d 94:6d:ae:f5:ab:98/ff:ff:ff:ff:ff:ff -j dnat --to-destination 94:6d:ae:f5:ab:99
## ACL ru-out in dir outbound on interface swp3 ##
# rule-id #1: #
-t nat -A POSTROUTING -o swp3 --comment rule_id:1,acl_name:ru-out,dir:outbound,interface_id:swp3 -s 94:6d:ae:f5:ab:98/ff:ff:ff:ff:ff:ff -j snat --to-source 94:6d:ae:f5:ab:98
# rule-id #2: #
-t nat -A POSTROUTING -o swp3 --comment rule_id:2,acl_name:ru-out,dir:outbound,interface_id:swp3 -s 94:6d:ae:f5:ab:99/ff:ff:ff:ff:ff:ff -j snat --to-source 94:6d:ae:f5:ab:98
port1_active.rules
# Auto-generated by NVUE!
# Any local modifications will prevent NVUE from re-generating this file.
# md5sum: 4e39a3b53931b61cc128e47ce2ca6d2c
[iptables]
[ip6tables]
[ebtables]
## ACL ru-in in dir inbound on interface swp3 ##
# rule-id #1: #
-t nat -A PREROUTING -i swp3 --comment rule_id:1,acl_name:ru-in,dir:inbound,interface_id:swp3 -d 94:6d:ae:f5:ab:98/ff:ff:ff:ff:ff:ff -j dnat --to-destination 94:6d:ae:f5:ab:98
## ACL ru-out in dir outbound on interface swp3 ##
# rule-id #1: #
-t nat -A POSTROUTING -o swp3 --comment rule_id:1,acl_name:ru-out,dir:outbound,interface_id:swp3 -s 94:6d:ae:f5:ab:98/ff:ff:ff:ff:ff:ff -j snat --to-source 94:6d:ae:f5:ab:98
# rule-id #2: #
-t nat -A POSTROUTING -o swp3 --comment rule_id:2,acl_name:ru-out,dir:outbound,interface_id:swp3 -s 94:6d:ae:f5:ab:99/ff:ff:ff:ff:ff:ff -j snat --to-source 94:6d:ae:f5:ab:98