sharp_am Network Interfaces

NVIDIA Scalable Hierarchical Aggregation and Reduction Protocol (SHARP) Rev 3.4.0

sharp_am communicates with the following entities:

  • IB switches - sharp_am sends MADs to get status and configure the switches for SHARP activities.

    The MADs communication with IB switches takes place over the IB network.

  • libsharp - Rank0 of collective operation, sending SHARP job requests to sharp_am and receiving sharp_am instructions.

    The communication with libsharp is performed via a proprietary binary protocol called smx. The transport layer of the smx can be via IB using UCX (InfiniBand transport), or via sockets (Ethernet).

  • UFM - when operating inside UFM, various information and configuration commands are passed from UFM to sharp_am.

    The communication with UFM is also performed via the smx proprietary protocol. However, the transport layer of this communication is unix-socket.

By default, sharp_am uses the opensm IB interface for the MADs and libsharp communication.

The communication with libsharp is done via socket (Ethernet) transport by default.

A unix-socket is kept open by default for communication with UFM.

It is possible to specify certain interfaces and to change the communication protocol, using the following configuration parameters:

Parameter

Component

Description

ib_port_guid

sharp_am

Sets the GUID of the port to which sharp_am binds to, for all MAD communication with the switches.

Value of 0 means to use the same port that is used by OpenSM.

Default value: 0

smx_enabled_protocols

sharp_am

A bitmask specifying which transport layers should be enabled for smx communication. It is possible to provide multiple options.

Bit 1 (value 1) - UCX

Bit 2 (Value 2) - Sockets.

Bit 3 (value 4) - Unix sockets (needed for UFM).

Default value: 6, which means Sockets & Unix sockets.

smx_protocol

sharp_am

Defines the default protocol that will be used when communicating with libsharp.

Value 1 - UCX.

Value 2 - Sockets.

Default value: 2, which means sockets.

smx_sock_interface

sharp_am

Relevant only in case that smx socket transport is enabled.

Sets the interface to be used by smx for the sockets connections. The interface should be mentioned by its name.

Empty value means to use the same interface used by OpenSM, using IP-over-IB in this case.

When sharp_am is operating inside UFM, this parameter is automatically set by UFM according to its internal logic and the am_interface parameter in the gv.cfg file.

Default value: Empty.

smx_sock_port

sharp_am

Relevant only in case that smx socket transport is enabled.

IP port number to be used for the socket listener.

Default value: 6126

smx_ucx_interface

sharp_am

Relevant only in case that smx UCX transport is enabled.

Sets the interface to be used by smx for the UCX connections. The interface should be mentioned by its name.

Empty value means to use the same interface used by OpenSM.

Default value: Empty.

In case the management host has multiple network interfaces, sharp_am can operate in HA mode, automatically handling network interface failures and switching to an active interface without interrupting any activity.

HA support for the IB transport is handled by sharp_am itself, while HA for Ethernet transport is handled by ip-bonding.

Warning

In the event of network failure while a new job is being established, the operation will fail. However, upcoming job requests will not be affected, and on-going jobs will continue to operate as usual.

HA Configuration

  1. ib_port_guid should be set to 0 (as its default), indicating that sharp_am should choose which port to use and which not to use.

  2. allow_remote_sm - should be set to False (as its default). HA of the IB ports can operate only when sharp_am resides on the same machines with OpenSM.

  3. In case smx ucx is enabled, smx_ucx_interface should be empty (as its default), indicating that sharp_am should choose which interface to use and which not to use.

  4. In case that smx socket is enabled, ip-bonding should be configured on the management host and smx_sock_interface should be set to the bond interface.

UFM Appliance Gen 3.x uses firewall that is configured to block the TCP port used by sharp_am by default, preventing SHARP clients from communicating with sharp_am. However, if you need to use UFM Appliance Gen 3.x with SHARP, you can resolve this by opening the required TCP port by running ufw allow 6126/tcp. Make sure that the port you specify in the 'smx_sock_port' config parameter matches the one you allow through the firewall.

© Copyright 2023, NVIDIA. Last updated on Nov 7, 2023.