NVIDIA Scalable Hierarchical Aggregation and Reduction Protocol (SHARP) Rev 3.1.0
1.0

Known Issues

Internal Reference Number

Issues

3230585

Description: When operating in Dynamic trees mode, ibdiagnet may print warning messages about the existence of multiple distinct trees with the same tree ID. In Dynamic trees mode, this is a valid situation, and these warnings should be ignored.

Warning example:

-W- <> - In Node <> found root tree (parent qpn <>) which is already exists for treeID: <>

Workaround: N/A

Keywords: Dynamic tree; ibdiagnet

Discovered in Version: 3.1.0

3209930

Description: Switch firmware miscalculates the required timeout interval for acks to be returned for data messages. This can lead to false alerts about bad connections, resulting in traps being sent and jobs stopped midway.

The following configuration setting in sharp_am fixes the timeout settings on the switches and is desired to be used:

ib_sat_qpc_local_ack_timeout = 0×19
ib_qpc_local_ack_timeout = 0×19

Workaround: N/A

Keywords: Switch firmware; ack; timeout

Discovered in Version: 3.1.0

3225401

Description: Dynamic trees creation feature does not support a case in which all root switches are down and restarted. If such a scenario takes place, sharp_am should be restarted once the root switches are up and running.

Workaround: N/A

Keywords: Aggregation Manager; sharp_am; dynamic trees

Discovered in Version: 3.1.0

3237831

Description: SHARP does not support reassignment of LID values.
In case LID reassignment is desired, make sure to stop all SHARP jobs, reassign LIDs via OpenSM, and restart sharp_am once the reassignment is done.

Workaround: N/A

Keywords: Aggregation Manager; OpenSM

Discovered in Version: 3.1.0

3226743

Description: When the management host is not connected to a leaf switch, sharp_am might print the following warnings:
Ignoring span_all_agg_nodes option parameter value. Spanning all agg nodes is supported only on tree topology.
Tree id: <id> does not spans over aggregation node: Mellanox Technologies Aggregation Node GUID:<management host guid>

The reason for these warnings is that the management host is treated as a potential compute host and it cannot be reached by all SHARP trees, unlike all other compute hosts that are connected to the leaf switches.
As long as the mentioned GUID is of the management host, these warning messages can be ignored.

Workaround: N/A

Keywords: Aggregation Manager; sharp_am; leaf; GUID

Discovered in Version: 3.1.0

3236363

Description: A physical link failure between switches while a SHARP job is running and utilizing the link can cause one of the switches to become invalid for further SHARP jobs, resulting in either "No resource" response for new SHARP job requests, or in jobs getting stuck.

Workaround: In case of "No resource" response for a SHARP job request, you can identify whether the described scenario is the reason by looking in the sharp_am log for a message such as:

[error] AN Mellanox Technologies Aggregation Node GUID:<> (LID: <>) responded with status <Not zero> to ResourceCleanup(Clean Job tree) - job_id_sharp: <>, tree_id: <>

In case such an error exists in the log, restart sharp_am to clear this status and enable SHARP jobs on the invalid switch.

Keywords: Aggregation Manager; sharp_am; Link Failure

Discovered in Release: 3.1.0

3048427

Description: In the case that a switch split mode is modified (off/on), sharp_am does not handle the new number of supported ports unless it is restarted.

Workaround: Restart sharp_am after changing a switch split mode definition.

Keywords: Aggregation Manager; split mode

Discovered in Release: 2.7.0

3051699

Description: Changing the configuration of SHARP switch ports using device_configuration_file does not take effect on disconnected split ports. If these ports are connected later, they will remain with their default configuration.

Workaround: If the new configuration is desired for the split ports, make sure to restart the Aggregation Manager after connecting a split port to a host.

Keywords: Aggregation Manager; split port

Discovered in Release: 2.7.0

3051924

Description: Adding or replacing non-leaf switches is currently not supported by Aggregation Manager for tree topologies (Fat-Tree, Quasi-Fat-Tree) and Dragonfly+ topologies.

Workaround: Restart Aggregation Manager after the Subnet Manager completes fabric reconfiguration followed by the fabric changes.

Keywords: Fabric extension; Aggregation Manager; AM

Discovered in Release: 2.7.0

-

Description: On multi PKEY environment, UCX in SHARP can use only the default PKEY (PKEY at index 0).

Workaround: Use sockets for communication over non-default PKEY.

Keywords: Configuration, SMX, UCX, PKEY

Discovered in Release: 2.4.3

1307124

Description: Begin Job requests with virtual ports might be rejected until fabric virtualization info file is parsed.

Workaround: Wait for AM to discover virtual ports before sending Begin Job requests.

Keywords: Aggregation Manager, Socket Direct, Virtual Ports

Discovered in Release: 1.5.3

1193629

Description: Configuring sharpd/sharp_am as daemons is not possible when installing from RPM into non-default location.

Workaround: Configure daemon manually.

Keywords: Configuration

Discovered in Release: 1.5.3

1307108

Description: Discovering a new Aggregation Node (AN) found on the shortest path between two ANs might invalidate the existing path.

Workaround: Restart Aggregation Manager after the Subnet Manager completes fabric reconfiguration followed by the fabric changes.

Keywords: Aggregation Manager, Aggregation Node

Discovered in Release: 1.5.3

-

Description: Adding new switches or switch replacement are currently not supported by the Aggregation Manager for Hypercube and Dragonfly+ topologies.

Workaround: Restart Aggregation Manager after the Subnet Manager completes fabric reconfiguration followed by the fabric changes.

Keywords: Fabric extension, Aggregation Manager

Discovered in Release: 1.5.3

-

Description: Adding new non-root switches or non-root switch replacement are currently not supported by the Aggregation Manager for tree topologies. (Fat-Tree, Quasi-Fat-Tree)

Workaround: Restart Aggregation Manager after the Subnet Manager completes fabric reconfiguration followed by the fabric changes.

Keywords: Fabric extension, Aggregation Manager

-

Description: Aggregation Manager High Availability is currently not supported in HPCX/MLNX OFED packages. Therefore, only a single instance of Aggregation Manager can run in the IB fabric.

Workaround: Use Aggregation Manager in UFM.

Keywords: Aggregation Manager

-

Description: Aggregation manager should run on the same Host where the Master Subnet Manager (SM) is running.

Workaround: N/A

Keywords: Aggregation Manager

-

Description: In case of HPCX/MLNX OFED packages, upon Subnet Manager handover/failover, another instance of Aggregation Manager should be started on the Host where the new Master SM is running

Workaround: Use Aggregation Manager in UFM.

Keywords: Aggregation Manager

-

Description: Aggregation Manager should be started after completion of fabric configuration by the Subnet Manager.

Workaround: N/A

Keywords: Aggregation Manager

-

Description: Only Fat-Tree, Quasi-Fat-Tree, Hypercube and Dragonfly+ topologies are supported by the Aggregation Manager.

Workaround: N/A

Keywords: Fabric Topology

-

Description: Only IB fabrics where all compute nodes are connected to Mellanox SHARP capable switches (Switch-IB 2) are supported by the Aggregation Manager.

Workaround: Manually configure mapping between the compute port and the Aggregation Node.

Keywords: Fabric Topology

-

Description: Upon changes in configuration file beyond parameters in 3.3, Aggregation Manager should be restarted to deploy new configuration.

Workaround: N/A

Keywords: Configuration


© Copyright 2023, NVIDIA. Last updated on May 23, 2023.