PDR Deterministic Plugin
The PDR deterministic plugin, overseen by the UFM, is a docker container that isolates malfunctioning ports, and then reinstates the repaired links to their previous condition by lifting the isolation. The PDR plugin uses a specific algorithm to isolate ports, which is based on telemetry data from the UFM Telemetry. This data includes packet drop rate, BER counter values, link down counter, and port temperature. Any decisions made by the plugin will trigger an event in the UFM for tracking purposes.
The PDR plugin performs the following tasks:
Collects telemetry data using UFM Dynamic Telemetry
Identifies potential failures based on telemetry calculations and isolates them to avert any interruption to traffic flow
Maintains a record of maintenance procedures that can be executed to restore an isolated link
After performing the required maintenance, the system verifies if the ports can be de-isolated and restored to operational status (brought back online).
The plugin can simulate port isolation without actually executing it for the purpose of analyzing the algorithm's performance and decision-making process in order to make future adjustments. This behavior is achieved through the implementation of a "dry_run" flag that changes the plugin's behavior to solely record its port "isolation" decisions in the log, rather than invoking the port isolation API. All decisions will be recorded in the plugin's log.
Install UFM with the latest software version.
Run:
ufmapl [ mgmt-sa ] (config) # ufm start
To get PDR plugin image, please contact the NVIDIA Support team. After that, load the plugin using this command:
When working with UFM in HA mode, load the plugin on the standby node.
docker load -i ufm-plugin-pdr-determinitic.tar
Run the following command. Add -p pdr-determinitic to enable the plugin:
ufmapl [ mgmt-sa ] (config) # ufm plugin pdr-determinitic add
Ensure that the plugin is up and running. Run:
ufmapl [ mgmt-sa ] (config) # show ufm plugin
NDR Link Validation Procedure
Verify ports that are in INIT, ARMED or ACTIVE states only. Track the SymbolErrorsExt of every such link for at least 120m. If polling period is Pm, need to keep N=(125+Pm+1)/Pm samples. Also, two delta samples are computed: number of samples covering 12 minutes S12m = (12 + Pm + 1)/Pm and S125m = (125 + Pm + 1)/Pm. 12m_thd = LinkBW_Gbps*1e9*12*60*1e-14 (2.88 for NDR) and
125m_thd = LinkBW_Gbps*1e9*125*60*1e-15 (3 for NDR).
Check the following conditions for every port in the given set:
If the Delta(LinkDownedCounterExt) port is > 0 and the Delta(LinkDownedCounterExt) remote port is > 0, add it to the list of bad_ports. This condition should be ignored if the --no_down_count flag is provided.
If the symbol_errors[now_idx] – symbol_errors[now_idx – S12m] is > 12m_thd, add the link to the list of bad_ports, and continue with next link.
If the symbol_errors[now_idx] – symbol_errors[now_idx – S125m] is > 125m_thd, add the link to the list of bad_ports, continue with next linkPacket drop rate criteria
When packet drops due to the link health are detected, isolate the problematic link. To achieve this, a target packet_drop/packet_delivered ratio can be employed to include TX ports with a receiver exceeding this threshold in the list of bad_ports. However, the drawback of this method is that such links may fluctuate between bad/good state since their BER may be normal. Therefore, it is advisable to track their statistics over time and refrain from reintegrating them after their second or third de-isolation.
Return to Service
Continuously monitoring the collection of bad_ports, the plugin persistently assess their Bit Error Rate (BER) and determines their reintegration when they successfully pass the 126m test without errors.
Configuration
The following parameters are configurable via the plugin’s configuration file. (pdr_deterministic.conf)
Name |
Description |
Default Value |
T_ISOLATE |
Interval for requesting telemetry counters, in seconds. |
300 |
MAX_NUM_ISOLATE |
Maximum ports to be isolated. max(MAX_NUM_ISOLATE, 0.5% * fabric_size) |
10 |
TMAX |
Maximum temperature threshold |
70 (Celsius) |
D_TMAX |
Maximum allowed Temperature Delta |
10 |
MAX_PDR |
Maximum allowed packet drop rate |
1e-12 |
CONFIGURED_BER_CHECK |
If set to true, the plugin will isolate based on BER calculations |
True |
CONFIGURED_TEMP_CHECK |
If set to true, the plugin will isolate based on temperature measurements |
True |
NO_DOWN_COUNT |
If set to true, the plugin will isolate based on LinkDownedCounterExt measurements |
True |
ACCESS_ISOLATION |
If set to true, the plugin will isolate ports connected via access link |
True |
DRY_RUN |
Isolation decisions will be only logged and will not take effect |
False |
DEISOLATE_CONSIDER_TIME |
Consideration time for port de-isolation (in minutes) |
5 |
DO_DEISOLATION |
If set to false, the plugin will not perform de-isolation |
True |
DYNAMIC_WAIT_TIME |
Seconds to wait for the dynamic telemetry session to respond |
30 |
Calculating BER Counters
For calculating BER counters, the plugin extracts the maximum window it needs to wait for calculating the BER value, using the following formula:
Example:
Rate |
BER Target |
Minimum Bits |
Minimum Time in Seconds |
In Minutes |
|
HDR |
2.00E+11 |
1.00E-12 |
1.00E+12 |
5 |
0.083333 |
HDR |
2.00E+11 |
1.00E-13 |
1.00E+13 |
50 |
0.833333 |
HDR |
2.00E+11 |
1.00E-14 |
1.00E+14 |
500 |
8.333333 |
HDR |
2.00E+11 |
1.00E-16 |
1.00E+16 |
50000 |
833.3333 |
BER counters are calculated with the following formula:
Install UFM with the latest software version.
Run:
ufmapl [ mgmt-sa ] (config) # ufm start
To get PDR plugin image, please contact the NVIDIA Support team. After that, load the plugin using this command:
When working with UFM in HA mode, load the plugin on the standby node.
docker load -i ufm-plugin-pdr-determinitic.tar
Run the following command. Add -p pdr-determinitic to enable the plugin:
ufmapl [ mgmt-sa ] (config) # ufm plugin pdr-determinitic add
Ensure that the plugin is up and running. Run:
ufmapl [ mgmt-sa ] (config) # show ufm plugin