NVIDIA UFM Enterprise User Manual v6.15.0
v6.15.0

PDR Deterministic Plugin

Overview

The PDR Deterministic plugin is a Docker container that is managed by the UFM and is designed to manage port isolation instead of the UFM automatic isolation. In order to perform port isolation, the PDR plugin utilizes an algorithm that depends on telemetry data provided by UFM Telemetry and monitors packet drop rate (PDR), BER counter values, and cable temperature. Additionally, the plugin can operate in a "dry run" mode, which enables writing to the log without initiating port isolation.

  1. Install UFM with the latest software version.

  2. Run:

    Copy
    Copied!
                

    /etc/init.d/ufmd start

  3. To get PDR plugin image, please contact the NVIDIA Support team. After that, load the plugin using this command:

    When working with UFM in HA mode, load the plugin on the standby node.

    Copy
    Copied!
                

    ufmapl [ mgmt-sa ] (config) # docker load ufm-plugin-pdr-determinitic.tar

  4. Run the following command. Add -p pdr-determinitic to enable the plugin:

    Copy
    Copied!
                

    /opt/ufm/scripts/manage_ufm_plugins.sh add -p pdr-determinitic

  5. Ensure that the plugin is up and running. Run: /opt/ufm/scripts/manage_ufm_plugins.sh show

The following table lists the default configuration when running the plugin. These configurations can be changed via the pdr_deterministic.conf file.

Value

Default Value

Description

T_ISOLATE

300

Interval for requesting telemetry counters in seconds

MAX_NUM_ISOLATE

10

Maximum number of ports to be isolated. Max(10,0.5% * fabric_size)

TMAX

70

The maximal nominal operating temperature for fabric devices and cables (minimum of the two)

Value is in Celsius.

D_TMAX

10

The maximum allowed temperature change within T_ISOLATE interval. Value is in Celsius.

MAX_PDR

1e-12

The maximum allowed Packet Drop Rate.

CONFIGURED_BER_CHECK

True

Indicates whether to check BER counters thresholds

DRY_RUN

False

Isolation decisions are only logged and will not take affect

DEISOLATE_CONSIDER_TIME

5

Consideration time for port de-isolation (in minutes)

AUTOMATIC_DEISOLATE

True

automatically performs de-isolation, even if a port is not set as "treated"

DO_DEISOLATION

True

If set to false, the plugin does not perform de-isolation

Warning

BER thresholds will be taken from the Field_BER_Thresholds.csv file.

The plugin’s purpose is to isolate malfunctioning ports using the isolation API from the UFM. A port is set as isolated if the values of its counter pass the thresholds of its cable temperature, effective BER, symbol BER, raw BER, or packet drop rate. A port can be de-isolated if its values are back to normal for 5 minutes (configurable).

The primary objective of the plugin is to utilize the isolation API provided by the UFM to isolate malfunctioning ports. A port is set as "isolated" when the values of its counter surpass the predetermined thresholds for parameters such as temperature, effective BER, symbol BER, raw BER, or packet drop rate.

For calculating BER counters, the plugin extracts the maximum window it needs to wait for calculating the BER value, using the following formula:

image2023-5-2_13-18-14.png

Example:

Rate

BER Target

Minimum Bits

Minimum Time in Seconds

In min

HDR

2.00E+11

1.00E-12

1.00E+12

5

0.083333

HDR

2.00E+11

1.00E-13

1.00E+13

50

0.833333

HDR

2.00E+11

1.00E-14

1.00E+14

500

8.333333

HDR

2.00E+11

1.00E-16

1.00E+16

50000

833.3333

BER counters are calculated with the following formula:

image2023-5-2_13-19-34.png

The following telemetry counters are used:

  • Symbol: phy_symbol_errors_high/low

  • Effective: phy_effective_errors_high/low

  • raw: sum(phy_raw_errors_lane<i>_high/low)

Data is kept in memory and is saved for the largest window period.

The plugin can simulates port isolation without actually executing it for the purpose of analyzing the algorithm's performance and decision-making process in order to make future adjustments. This behavior is achieved through the implementation of a "dry_run" flag that changes the plugin's behavior to solely record its port "isolation" decisions in the log, rather than invoking the port isolation API. All decisions will be recorded in the plugin's log.

© Copyright 2023, NVIDIA. Last updated on Nov 8, 2023.