Operations Procedures

Introduction

This document describes guidelines and working methods for NDR clusters and later.

UFM Service Verification

Confirm the operational status of the UFM service.

If you are using UFM Enterprise Appliance, execute the following command via the command-line interface (CLI) after logging into the UFM appliance.
Copy

Copied!
```
            
            show ufm status
        
```
For more information, refer to UFM General Commands

If you are using your own server, refer to Showing UFM Processes Status.
If you prefer using the web user interface:
- Navigate to the "System Health" tab in the left menu.
- Under the "UFM Health" section, click on "Create New Report."
- Confirm that all fields are displaying green indicators.
  For detailed instructions, refer to UFM Health Tab

It is also recommended to conduct a remote test of the REST API by querying the "UFM Health" report. For instructions, refer to Reports REST API .

Fabric Health Report Generation and Validation

To generate fabric health report and verifying all sections are green, perform the following steps using Web UI:

Access the "System Health" tab on the left menu
- Click on "Run New Report" under the "Fabric Health" section
- Confirm that all fields are indicating green status
- For detailed instructions, refer Fabric Health Tab
- Additionally, within the "System Health" tab:
Run the available tests under "Fabric Validation"
- Verify the outcomes as either "Pass" or "Completed with No Errors"
- For detailed instructions, refer Fabric Validation Tab .
- Furthermore, it is recommended to conduct remote REST API tests from a remote node. This can be done using the REST APIs described in the following links:
Reports REST API
Fabric Validation Tests REST API

Once the InfiniBand cluster is built, it is essential to create a Master Topology. This Master Topology serves as a reference during cluster operation, enabling the detection of any network configuration changes. It is noteworthy that the actual cluster topology may be different from the initially planned specifications. Detecting and validating these discrepancies in topology is crucial to ensure the cluster's proper functionality.

As an example, even in cases where a known TOR switch is defected due to hardware malfunction and is planned for RMA process, the cluster can still operate, albeit with some degradation in performance and anticipated capacity.

For a more comprehensive details, refer to Topology Compare REST API.

Telemetry Metrics Collection

To collect InfiniBand ports, PHY and cables telemetry metrics, perform the following.

Access the embedded UFM Telemetry instance through an HTTP End Point using the following URL to your browser address bar:

+http://$ufm_ip$:9001/labels/enterprise+

Please remember to replace your UFM IP according to your IP address, for example:

http://10.209.44.100:9001/labels/enterprise

Expected Results:

PortXmitDataExtended{device_name="",device_type="host",fabric="compute",hostname="swx-snap3",level="server",node_desc="swx-snap3 mlx5_0",peer_level="server",port_id="248a0703009a15fa_1"} 228011616 1648987628390
PortRcvDataExtended{device_name="",device_type="host",fabric="compute",hostname="swx-snap3",level="server",node_desc="swx-snap3 mlx5_0",peer_level="server",port_id="248a0703009a15fa_1"} 228011616 1648987628390
PortXmitPktsExtended{device_name="",device_type="host",fabric="compute",hostname="swx-snap3",level="server",node_desc="swx-snap3 mlx5_0",peer_level="server",port_id="248a0703009a15fa_1"} 791707 1648987628390
PortRcvPktsExtended{device_name="",device_type="host",fabric="compute",hostname="swx-snap3",level="server",node_desc="swx-snap3 mlx5_0",peer_level="server",port_id="248a0703009a15fa_1"} 791707 1648987628390
SymbolErrorCounterExtended{device_name="",device_type="host",fabric="compute",hostname="swx-snap3",level="server",node_desc="swx-snap3 mlx5_0",peer_level="server",port_id="248a0703009a15fa_1"} 0 1648987628390

For a more compact CSV data format, access the following endpoint:
http://$ufm_ip$:9001/labels/csv/metrics (http://$ufm_ip$:9001/labels/csv/metrics)
Remember to replace "$ufm_ip$" with your actual UFM IP address. Example:
http://10.209.44.100:9001/labels/csv/metrics

Link Monitoring Key Indicators

The following table lists the link monitoring key indicators and provides their descriptions, pass/fail criteria and monitoring intervals.

Parameter

Description

Evaluation Criteria

Monitoring Interval

Link State

Phy_state

Physical link state

Verify link up

ongoing

Logical_state

Logical link state

Verify link in ACTIVE mode

ongoing

speed_active

Active link speed

Verify expected speed

ongoing

width_active

Active link width

Verify expected width 4x, Or for split cable - 2x

ongoing

Link Quality

NDR Link Quality

Link Quality criteria depend of error correction scheme type.

Error Correction Scheme TYPE	Media Type	Post-FEC			Symbol
		Normal	Warning	Error	Normal	Warning	Error
*Default for DAC/ACC/Active < 100m* *Low_Latency_RS_FEC_PLR*	DAC/LACC/Active	1.00E-12	5.00E-12	1.00E-11	1.00E-15	5.00E-15	1.00E-14
*Default for DAC/ACC/Active > 100m* *KP4_Standard_RS_FEC*	Active	1.00E-15	5.00E-15	1.00E-14	1.00E-15	5.00E-15	1.00E-14

Note: Minimum port up time for BER measurement - 125 minutes.

ongoing

PHY Errors

Symbol_Errors

Errors after FEC and PLR

Defined by Symbol BER

ongoing

Link_Down counter

Total number of link down occurred as a result of involuntary link shutdown.

If delta from last sample > 0:

Trace the event and include switch, port, date and time, link down counter.
If same switch and port has at least 2 link down occurrences within 24 hours, further investigation required.
Note:
- Make sure link down was due to involuntary port down from the partner side (e.g. not due to partner server reboot).
- The criteria intends to catch major link down events.

ongoing

LInkErrorRecoveryCounter

The number of times the Port Training state machine has successfully completed the link error recovery process.

Clean, no errors

ongoing

Chip temperature

Temperature in C

If temperature reached max threshold FW will do protective thermal shutdown.

ongoing

Device FW version

Switch / HCA FW ver

Verify approved version is the last released version by NVIDIA,

Need to see the cluster have similar versions

Days

Cables Information

PN

Part number

No check required

Days

SN

Serial number

No check required

Days

FW ver

FW version

Verify approved version is the last released version by NVIDIA

Days

Module temperature

Optic module only

There is an alarm and threshold for each transceiver.

<ac:structured-macro ac:name="unmigrated-wiki-markup" ac:schema-version="1" ac:macro-id="d1bfc8cb-81de-4fd3-b7ed-43fd19d36b26"><ac:plain-text-body><![CDATA[Usually Warning [70c, 0c] and Alarm [80c, -10c]

ongoing

Rx power Tx power per lane

Optic module only

There is an alarm and threshold for each transceiver.

Minutes

Packet Discard

PortRcvErrors

Total number of packets containing an error that were received on the port.

< 10 per second (perform 2 successive samples)

Minutes

PortXmitDiscards

Total number of outbound packets discarded by the port because the port is down or congested.

< 10 per second (perform 2 successive samples)

Minutes

Cluster Performance Verification

The tool used for validating cluster performance is known as ClusterKit, an integral component of the HPC-X Software Toolkit.
NVIDIA® HPC-X® presents a comprehensive software bundle encompassing MPI and SHMEM communication libraries. Within this package, various acceleration components are included, enhancing the performance and scalability of applications that operate on top of these libraries. Notably, UCX (Unified Communication X) accelerates the underlying send/receive (or put/get) messages. Also included, HCOLL, which accelerates the underlying collective operations used by the MPI/PGAS languages.
For detailed documentation, along with instructions for downloading and installing HPC-X, refer to HPC-X Documentation.

HPC-X is Functionality Verification

To ensure the correct operation of HPC-X, a straightforward MPI test program bundled with HPC-X can be employed. Use the following procedure:

Set the HPCX_HOME environment variable to point to the HPCX installation directory:

Copy
Copied!

            
             % export HPCX_HOME=<HPCX Directory>

Initialize HPC-X environment variables:

Copy
Copied!

            
            % source $HPCX_HOME/hpcx-init.sh
% hpcx_load

Execute the precompiled MPI test program hello_c. The MPI program can be executed using either of the following methods:

Inside a SLURM allocation or job, run:

Copy
Copied!

            
             % mpirun $HPCX_MPI_TESTS_DIR/examples/hello_c

Without SLURM using SSEH and explicitly setting hosts to run on:

Copy
Copied!

            
            % mpirun --host <host1,host2,…,hostN> $HPCX_MPI_TESTS_DIR/examples/hello_c

Alternatively, you can put all hostnames into a single file (hostfile) and pass that file to mpirun (see mpirun(1) man page for details):

Copy
Copied!

            
             % mpirun --hostfile <hostfile> $HPCX_MPI_TESTS_DIR/examples/hello_c

The output should contain one line for every MPI process that was executed. Each line indicates the MPI rank of the process, the total number of processes, and the version of OpenMPI bundled with HPC-X. For instance:

Copy
Copied!

            
            Hello, world, I am 90 of 168, (Open MPI v4.1.5rc2, package: Open MPI root@hpc-kernel-03 Distribution, ident: 4.1.5rc2, repo rev: v4.1.5rc1-16-g5980bac633, Unreleased developer copy, 150)
The number of lines should match the number of cores in the allocation

Check that the ClusterKit script (clusterkit.sh) is available. Run:

Copy
Copied!

            
            cpde ls -l $HPCX_CLUSTERKIT_DIR/bin/run_clusterkit.sh

Check that the file $HPCX_CLUSTERKIT_DIR/bin/run_clusterkit.sh exists and is executable.

Running ClusterKit

Prior to executing ClusterKit, it is important to have HPC-X properly set up with initialized environment variables. Additionally, ensure that the ClusterKit script (clusterkit.sh) is accessible, as instructed in the preceding section.
ClusterKit can be run inside SLURM allocation or job or without SLURM. When operating within a SLURM allocation, employ the following command:

Copy
Copied!

            
            $HPCX_CLUSTERKIT_DIR/bin/clusterkit.sh -d mlx5_4:1 -x "-d bw"

Where -d adapter:port selects which InfiniBand adapter and port to use and -x "-d bw" sets which test to run (bandwidth test).
If running outside SLURM allocation, use:

Copy
Copied!

            
            $HPCX_CLUSTERKIT_DIR/bin/clusterkit.sh -f hostfile -d mlx5_4:1 -x "-d bw"

Where -f hostfile sets hostfile to use. The hostfile contains the list of nodes to use (see mpirun(1) man page for details).

You can add -D <output dir> switch to set the output directory for the run. Without it, the output will be saved into the directory composed of date and time of the run (e.g., 20230731_154932).

In the output directory two files are create, bandwidth.json and bandwidth.txt. bandwidth.json can be used for automatic processing of the results which is out of scope of this document. In bandwidth.txt see the last 3 line of text which look like:

Copy
Copied!

            
            Minimum bandwidth: 24869.6 MB/s between node14 and node28
Maximum bandwidth: 25208.7 MB/s between node02 and node13
Average bandwidth: 25002.5 MB/s

The results are in decimal Megabytes per second (10⁶ Bytes per second).

Results Verification

Your cluster's performance is satisfactory when the minimum achieved result is at least 95% of the maximum available bandwidth, as illustrated in the table below.

For your convenience, the technology of your cluster interconnect is shown in the header of the bandwidth.txt file.

Expected InfiniBand Performance (for 4x Connections)

Technology	Speed, Gb/s	95% performance, MB/s
EDR	100	11 515
HDR	200	23 030
NDR	400	46 060

Review All Unhealthy Nodes

Once the UFM examines the behavior of subnet nodes, including switches and hosts, and identifies a node as "unhealthy" based on internal conditions, this node is displayed in the "Unhealthy Ports" list. Once a node is declared as "unhealthy," the Subnet Manager either ignores, reports, isolates, or disables the node. Users hold the authority to control the executed actions and the criteria that categorize a node as "unhealthy." Furthermore, the user can "clear" nodes previously labeled as "unhealthy".
To navigate through these functionalities using the Web User Interface, refer to Unhealthy Ports Window . to review all unhealthy nodes using Web UI. Alternatiely, use the REST API from a remote node via the Unhealthy Ports REST API .

Congestion Monitoring with UFM Telemetry

Since InfiniBand is lossless, the network does not drop packets which may cause network congestion. The metric XmitWaitPerc provides the percentage of time in which ports had data to send but could not progress due to congestion. This metric can be obtained per topology layer (distance from the source hosts towards the destination hosts) or for each link separately.
If a switch port connected to a host is showing >5% of XmitWaitPerc, then the most probably cause is that the host PCIe or its memory is not healthy.
If XmitWaitPerc >5% on links/layer that are not driving a host, then that is most probably caused by traffic that exceeds the capacity of that layer. This is normal for over-subscribed networks where the total number of cables connecting the switches of that layer to the next one is smaller than the number of cables connected to previous layers. But if the network is not over-subscribed, a high XmitWaitPerc can be strong sign that adaptive routing is not used, or some many-to-one traffic patterns are used by the applications. Or that some missing (unhealthy links) makes a specific switch over-subscribed.
For more information, refer to "Top X Telemetry Sessions REST API" under Telemetry REST API .

Monitoring Systems Integrations

The monitoring system includes UFM Telemetry and optional streaming of its results into customer specific Network Management Systems. UFM Telemetry is responsible for collecting the vital networking metrics and forwarding them to a data-lake or other customer data analysis tools.

A comprehensive array of plugins, scripts, recipes and tools to facilitate UFM integration with third-party network management systems can be accessed via a publicly available GitHub repository: UFM SDK Repository .
Furthermore, the UFM Software Development Kit (SDK) allows extension of the capabilities of the UFM platform with additional tools.

The following are links for instruction detailing installation and usage of Telemetry Streaming/Forwarding Plugins:

Latest SW Updates

For a catalog of the validated configuration products and their respective versions that have been rigorously tested together and endorsed by NVIDIA, please refer to Quantum-2 Clusters. This page provides the exact version per InfiniBand product (e.g., Switch, HCA, UFM, Transceiver, etc.), that was released and tested as a bundle.
In case your cluster is running Long Term Support (LTS) releases, it is recommended to check NVIDIA LTS web page for the latest LTS release. To gain insights into critical bug resolutions, it is recommended to visit the release notes for each specific product: Long-Term Support (LTS) Releases .
Please note that, currently, a complete maintenance window is required for device firmware upgrades.
For UFM and OpenSM upgrades, a staged approach can be adopted: begin by upgrading the secondary UFM, transition to it as the master, and subsequently proceed with upgrading the prior master UFM.

Cooling System Maintenance

For cooling system maintenance using web UI, run the check temperature test from fabric validation tab. For more information, refer to Fabric Validation Tab .
For cooling system maintenance using the REST APIs, issue a POST request using the following URL:
POST /ufmRest/fabricValidation/tests/CheckTemperature
For more information, refer to Fabric Validation Tests REST AP I .

On This Page

Operations Procedures

Introduction

UFM Service Verification

Fabric Health Report Generation and Validation

Cluster Topology Validation

Telemetry Metrics Collection

Link Monitoring Key Indicators

Cluster Performance Verification

HPC-X is Functionality Verification

Running ClusterKit

Results Verification

Review All Unhealthy Nodes

Congestion Monitoring with UFM Telemetry

Monitoring Systems Integrations

Latest SW Updates

Cooling System Maintenance