This document describes guidelines and working methods for NDR clusters and later.
Confirm the operational status of the UFM service.
If you are using UFM Enterprise Appliance, execute the following command via the command-line interface (CLI) after logging into the UFM appliance.
show ufm status
For more information, refer to UFM General Commands
If you are using your own server, refer to Showing UFM Processes Status.
If you prefer using the web user interface:
Navigate to the "System Health" tab in the left menu.
Under the "UFM Health" section, click on "Create New Report."
Confirm that all fields are displaying green indicators.
For detailed instructions, refer to UFM Health Tab
It is also recommended to conduct a remote test of the REST API by querying the "UFM Health" report. For instructions, refer to Reports REST API .
To generate fabric health report and verifying all sections are green, perform the following steps using Web UI:
Access the "System Health" tab on the left menu
Click on "Run New Report" under the "Fabric Health" section
Confirm that all fields are indicating green status
For detailed instructions, refer Fabric Health Tab
Additionally, within the "System Health" tab:
Run the available tests under "Fabric Validation"
Verify the outcomes as either "Pass" or "Completed with No Errors"
For detailed instructions, refer Fabric Validation Tab .
Furthermore, it is recommended to conduct remote REST API tests from a remote node. This can be done using the REST APIs described in the following links:
Once the InfiniBand cluster is built, it is essential to create a Master Topology. This Master Topology serves as a reference during cluster operation, enabling the detection of any network configuration changes. It is noteworthy that the actual cluster topology may be different from the initially planned specifications. Detecting and validating these discrepancies in topology is crucial to ensure the cluster's proper functionality.
As an example, even in cases where a known TOR switch is defected due to hardware malfunction and is planned for RMA process, the cluster can still operate, albeit with some degradation in performance and anticipated capacity.
For a more comprehensive details, refer to Topology Compare REST API.
To collect InfiniBand ports, PHY and cables telemetry metrics, perform the following.
Access the embedded UFM Telemetry instance through an HTTP End Point using the following URL to your browser address bar:
+http://$ufm_ip$:9001/labels/enterprise+
Please remember to replace your UFM IP according to your IP address, for example:
Expected Results:
PortXmitDataExtended{device_name="",device_type="host",fabric="compute",hostname="swx-snap3",level="server",node_desc="swx-snap3 mlx5_0",peer_level="server",port_id="248a0703009a15fa_1"} 228011616 1648987628390
PortRcvDataExtended{device_name="",device_type="host",fabric="compute",hostname="swx-snap3",level="server",node_desc="swx-snap3 mlx5_0",peer_level="server",port_id="248a0703009a15fa_1"} 228011616 1648987628390
PortXmitPktsExtended{device_name="",device_type="host",fabric="compute",hostname="swx-snap3",level="server",node_desc="swx-snap3 mlx5_0",peer_level="server",port_id="248a0703009a15fa_1"} 791707 1648987628390
PortRcvPktsExtended{device_name="",device_type="host",fabric="compute",hostname="swx-snap3",level="server",node_desc="swx-snap3 mlx5_0",peer_level="server",port_id="248a0703009a15fa_1"} 791707 1648987628390
SymbolErrorCounterExtended{device_name="",device_type="host",fabric="compute",hostname="swx-snap3",level="server",node_desc="swx-snap3 mlx5_0",peer_level="server",port_id="248a0703009a15fa_1"} 0 1648987628390
For a more compact CSV data format, access the following endpoint:
http://$ufm_ip$:9001/labels/csv/metrics
(http://$ufm_ip$:9001/labels/csv/metrics)
Remember to replace "$ufm_ip$" with your actual UFM IP address. Example:
The following table lists the link monitoring key indicators and provides their descriptions, pass/fail criteria and monitoring intervals.
Parameter |
Description |
Evaluation Criteria |
Monitoring Interval |
||||||||||||||||||||||||||||||||
Link State |
|||||||||||||||||||||||||||||||||||
Phy_state |
Physical link state |
Verify link up |
ongoing |
||||||||||||||||||||||||||||||||
Logical_state |
Logical link state |
Verify link in ACTIVE mode |
ongoing |
||||||||||||||||||||||||||||||||
speed_active |
Active link speed |
Verify expected speed |
ongoing |
||||||||||||||||||||||||||||||||
width_active |
Active link width |
Verify expected width 4x, Or for split cable - 2x |
ongoing |
||||||||||||||||||||||||||||||||
Link Quality |
|||||||||||||||||||||||||||||||||||
NDR Link Quality |
Link Quality criteria depend of error correction scheme type. |
Note: Minimum port up time for BER measurement - 125 minutes. |
ongoing |
||||||||||||||||||||||||||||||||
PHY Errors |
|||||||||||||||||||||||||||||||||||
Symbol_Errors |
Errors after FEC and PLR |
Defined by Symbol BER |
ongoing |
||||||||||||||||||||||||||||||||
Link_Down counter |
Total number of link down occurred as a result of involuntary link shutdown. |
If delta from last sample > 0:
|
ongoing |
||||||||||||||||||||||||||||||||
LInkErrorRecoveryCounter |
The number of times the Port Training state machine has successfully completed the link error recovery process. |
Clean, no errors |
ongoing |
||||||||||||||||||||||||||||||||
Chip temperature |
Temperature in C |
If temperature reached max threshold FW will do protective thermal shutdown. |
ongoing |
||||||||||||||||||||||||||||||||
Device FW version |
Switch / HCA FW ver |
Verify approved version is the last released version by NVIDIA, Need to see the cluster have similar versions |
Days |
||||||||||||||||||||||||||||||||
Cables Information |
|||||||||||||||||||||||||||||||||||
PN |
Part number |
No check required |
Days |
||||||||||||||||||||||||||||||||
SN |
Serial number |
No check required |
Days |
||||||||||||||||||||||||||||||||
FW ver |
FW version |
Verify approved version is the last released version by NVIDIA |
Days |
||||||||||||||||||||||||||||||||
Module temperature |
Optic module only |
There is an alarm and threshold for each transceiver. <ac:structured-macro ac:name="unmigrated-wiki-markup" ac:schema-version="1" ac:macro-id="d1bfc8cb-81de-4fd3-b7ed-43fd19d36b26"><ac:plain-text-body><![CDATA[Usually Warning [70c, 0c] and Alarm [80c, -10c] |
ongoing |
||||||||||||||||||||||||||||||||
Rx power Tx power per lane |
Optic module only |
There is an alarm and threshold for each transceiver. |
Minutes |
||||||||||||||||||||||||||||||||
Packet Discard |
|||||||||||||||||||||||||||||||||||
PortRcvErrors |
Total number of packets containing an error that were received on the port. |
< 10 per second (perform 2 successive samples) |
Minutes |
||||||||||||||||||||||||||||||||
PortXmitDiscards |
Total number of outbound packets discarded by the port because the port is down or congested. |
< 10 per second (perform 2 successive samples) |
Minutes |
The tool used for validating cluster performance is known as ClusterKit, an integral component of the HPC-X Software Toolkit.
NVIDIA® HPC-X® presents a comprehensive software bundle encompassing MPI and SHMEM communication libraries. Within this package, various acceleration components are included, enhancing the performance and scalability of applications that operate on top of these libraries. Notably, UCX (Unified Communication X) accelerates the underlying send/receive (or put/get) messages. Also included, HCOLL, which accelerates the underlying collective operations used by the MPI/PGAS languages.
For detailed documentation, along with instructions for downloading and installing HPC-X, refer to
HPC-X Documentation.
HPC-X is Functionality Verification
To ensure the correct operation of HPC-X, a straightforward MPI test program bundled with HPC-X can be employed. Use the following procedure:
Set the HPCX_HOME environment variable to point to the HPCX installation directory:
% export HPCX_HOME=<HPCX Directory>
Initialize HPC-X environment variables:
% source $HPCX_HOME/hpcx-init.sh % hpcx_load
Execute the precompiled MPI test program hello_c. The MPI program can be executed using either of the following methods:
Inside a SLURM allocation or job, run:
% mpirun $HPCX_MPI_TESTS_DIR/examples/hello_c
Without SLURM using SSEH and explicitly setting hosts to run on:
% mpirun --host <host1,host2,…,hostN> $HPCX_MPI_TESTS_DIR/examples/hello_c
Alternatively, you can put all hostnames into a single file (hostfile) and pass that file to mpirun (see mpirun(1) man page for details):
% mpirun --hostfile <hostfile> $HPCX_MPI_TESTS_DIR/examples/hello_c
The output should contain one line for every MPI process that was executed. Each line indicates the MPI rank of the process, the total number of processes, and the version of OpenMPI bundled with HPC-X. For instance:
Hello, world, I am
90
of168
, (Open MPI v4.1
.5rc2,package
: Open MPI root@hpc
-kernel-03
Distribution, ident:4.1
.5rc2, repo rev: v4.1
.5rc1-16
-g5980bac633, Unreleased developer copy,150
) The number of lines should match the number of cores in the allocationCheck that the ClusterKit script (clusterkit.sh) is available. Run:
cpde ls -l $HPCX_CLUSTERKIT_DIR/bin/run_clusterkit.sh
Check that the file $HPCX_CLUSTERKIT_DIR/bin/run_clusterkit.sh exists and is executable.
Running ClusterKit
Prior to executing ClusterKit, it is important to have HPC-X properly set up with initialized environment variables. Additionally, ensure that the ClusterKit script (clusterkit.sh) is accessible, as instructed in the preceding section.
ClusterKit can be run inside SLURM allocation or job or without SLURM. When operating within a SLURM allocation, employ the following command:
$HPCX_CLUSTERKIT_DIR/bin/clusterkit.sh -d mlx5_4:1
-x "-d bw"
Where -d adapter:port selects which InfiniBand adapter and port to use and -x "-d bw" sets which test to run (bandwidth test).
If running outside SLURM allocation, use:
$HPCX_CLUSTERKIT_DIR/bin/clusterkit.sh -f hostfile -d mlx5_4:1
-x "-d bw"
Where -f hostfile sets hostfile to use. The hostfile contains the list of nodes to use (see mpirun(1) man page for details).
You can add -D <output dir> switch to set the output directory for the run. Without it, the output will be saved into the directory composed of date and time of the run (e.g., 20230731_154932).
In the output directory two files are create, bandwidth.json and bandwidth.txt. bandwidth.json can be used for automatic processing of the results which is out of scope of this document. In bandwidth.txt see the last 3 line of text which look like:
Minimum bandwidth: 24869.6
MB/s between node14 and node28
Maximum bandwidth: 25208.7
MB/s between node02 and node13
Average bandwidth: 25002.5
MB/s
The results are in decimal Megabytes per second (106 Bytes per second).
Results Verification
Your cluster's performance is satisfactory when the minimum achieved result is at least 95% of the maximum available bandwidth, as illustrated in the table below.
For your convenience, the technology of your cluster interconnect is shown in the header of the bandwidth.txt file.
Expected InfiniBand Performance (for 4x Connections)
Technology |
Speed, Gb/s |
95% performance, MB/s |
EDR |
100 |
11 515 |
HDR |
200 |
23 030 |
NDR |
400 |
46 060 |
Once the UFM examines the behavior of subnet nodes, including switches and hosts, and identifies a node as "unhealthy" based on internal conditions, this node is displayed in the "Unhealthy Ports" list. Once a node is declared as "unhealthy," the Subnet Manager either ignores, reports, isolates, or disables the node. Users hold the authority to control the executed actions and the criteria that categorize a node as "unhealthy." Furthermore, the user can "clear" nodes previously labeled as "unhealthy".
To navigate through these functionalities using the Web User Interface, refer to
Unhealthy Ports Window
. to review all unhealthy nodes using Web UI. Alternatiely, use the REST API from a remote node via the
Unhealthy Ports REST API
.
Since InfiniBand is lossless, the network does not drop packets which may cause network congestion. The metric XmitWaitPerc provides the percentage of time in which ports had data to send but could not progress due to congestion. This metric can be obtained per topology layer (distance from the source hosts towards the destination hosts) or for each link separately.
If a switch port connected to a host is showing >5% of XmitWaitPerc, then the most probably cause is that the host PCIe or its memory is not healthy.
If XmitWaitPerc >5% on links/layer that are not driving a host, then that is most probably caused by traffic that exceeds the capacity of that layer. This is normal for over-subscribed networks where the total number of cables connecting the switches of that layer to the next one is smaller than the number of cables connected to previous layers. But if the network is not over-subscribed, a high XmitWaitPerc can be strong sign that adaptive routing is not used, or some many-to-one traffic patterns are used by the applications. Or that some missing (unhealthy links) makes a specific switch over-subscribed.
For more information, refer to "Top X Telemetry Sessions REST API" under
Telemetry REST API
.
The monitoring system includes UFM Telemetry and optional streaming of its results into customer specific Network Management Systems. UFM Telemetry is responsible for collecting the vital networking metrics and forwarding them to a data-lake or other customer data analysis tools.
A comprehensive array of plugins, scripts, recipes and tools to facilitate UFM integration with third-party network management systems can be accessed via a publicly available GitHub repository: UFM SDK Repository .
Furthermore, the UFM Software Development Kit (SDK) allows extension of the capabilities of the UFM platform with additional tools.
The following are links for instruction detailing installation and usage of Telemetry Streaming/Forwarding Plugins:
For a catalog of the validated configuration products and their respective versions that have been rigorously tested together and endorsed by NVIDIA, please refer to Quantum-2 Clusters. This page provides the exact version per InfiniBand product (e.g., Switch, HCA, UFM, Transceiver, etc.), that was released and tested as a bundle.
In case your cluster is running Long Term Support (LTS) releases, it is recommended to check NVIDIA LTS web page for the latest LTS release. To gain insights into critical bug resolutions, it is recommended to visit the release notes for each specific product: Long-Term Support (LTS) Releases .
Please note that, currently, a complete maintenance window is required for device firmware upgrades.
For UFM and OpenSM upgrades, a staged approach can be adopted: begin by upgrading the secondary UFM, transition to it as the master, and subsequently proceed with upgrading the prior master UFM.
For cooling system maintenance using web UI, run the check temperature test from fabric validation tab. For more information, refer to Fabric Validation Tab .
For cooling system maintenance using the REST APIs, issue a POST request using the following URL:
POST /ufmRest/fabricValidation/tests/CheckTemperature
For more information, refer to Fabric Validation Tests REST AP I .