NVIDIA Docs Hub Homepage NVIDIA Networking Accelerator Software NVIDIA Scalable Hierarchical Aggregation and Reduction Protocol (SHARP) Rev 3.12.0 NVIDIA SHARP Installation

NVIDIA SHARP Installation

NVIDIA SHARP consists of two main components: sharp_am, which runs in UFM, and libsharp, which is linked to client applications running on the compute nodes.

The recommended setup is to enable SHARP in UFM and install libsharp on the compute nodes using HPC-X.

Alternatively, SHARP can be installed via DOCA-Host; this approach is detailed in Appendix E: Using NVIDIA SHARP from DOCA-Host.

Enabling SHARP in UFM

When using UFM Enterprise, modify the configuration file conf/gv.cfg and set sharp_enabledparameter to true:

Copy
Copied!

            
            [sharp]
sharp_enabled = true

When using the UFM Appliance, you can also run the command lib sharp enable, which applies the same configuration change.

Note

In both UFM Enterprise and UFM Appliance, a UFM restart is required after making the change.

SHARP Network Settings in UFM

By default, sharp_am communicates with libsharp clients over IP over InfiniBand (IPoIB) using TCP port 6126.

To enable pure InfiniBand-based communication via UCX, edit the SHARP configuration file at conf/sharp/sharp_am.cfg and set the smx_protocol parameter to 1. A restart of sharp_am is required for the change to take effect.

UFM Appliance Firewall Settings

UFM Appliance Gen 3.x includes a firewall that, by default, blocks the TCP port used by sharp_am, preventing SHARP clients from connecting when TCP is used.

To allow this communication, open the required port by running:

Copy
Copied!

            
            ufw allow 6126/tcp

Installing the SHARP Client (libsharp) via HPC-X

The libsharp component (SHARP client) is installed on compute nodes using the HPC-X package.

Refer to the HPC-X User Manual for installation and configuration instructions, and ensure that HPC-X is installed in a shared directory accessible to all compute nodes.

For usage details after installation, see Using SHARP section.

To download the HPC-X packages, please visit here.

Testing the Setup

While ibdiagnet can be used to check for general fabric errors, it does not verify communication between libsharp and sharp_am, nor whether SHARP jobs can be successfully created.

To test the SHARP installation and configuration, two test applications are available in the sharp/bin directory. Both are designed to verify client-side behavior:

sharp_hello – Verifies that a SHARP client can communicate with sharp_am and successfully create a SHARP job. The test creates a simple SHARP tree involving only one compute node. It checks job creation only and does not perform data aggregation.
sharp_coll_test – Performs a more comprehensive test. It measures bandwidth between the node’s HCA and its connected switch, and also verifies SHARP data transfer. This makes it a more thorough diagnostic than sharp_hello, as it checks both job creation and data aggregation.

Running Tests from the UFM Machine

Although it is recommended to run SHARP tests from a compute node, testing from the UFM machine is possible with a configuration change.

By default, sharp_am blocks libsharp from running on management nodes (including the active and standby UFM machines). To allow testing from the UFM node:

Edit the configuration file:
conf/sharp/sharp_am.cfg
Set the following parameter:
ignore_sm_guids = False
Restart sharp_am for the changes to take effect.

NVIDIA SHARP Hello

NVIDIA SHARP distribution provides sharp_hello test utility for testing SHARP's end-to-end functionality on a compute node. It creates a single SHARP job and sends a barrier request to SHARP Aggregation node.

Help

Copy
Copied!

            
            $sharp_hello -h
usage:  sharp_hello <-d | --ib_dev> <device> [OPTIONS]
OPTIONS:
        [-d | --ib_dev]      - HCA to use
        [-v | --verbose]     - libsharp coll verbosity level(default:2)
                                  Levels: (0-fatal 1-err 2-warn 3-info 4-debug 5-trace)
        [-V | --version]     - print program version
        [-h | --help]        - show this usage

Example #1

Copy
Copied!

            
            $ sharp_hello -d mlx5_0:1 -v 3
[thor001:0:15042 - context.c:581] INFO job (ID: 12159720107860141553) resource request quota: ( osts:0 user_data_per_ost:0 max_groups:0 max_qps:1 max_group_channels:1, num_trees:1)
[thor001:0:15042 - context.c:751] INFO tree_info: type:LLT tree idx:0 treeID:0x0 caps:0x6 quota: ( osts:167 user_data_per_ost:1024 max_groups:167 max_qps:1 max_group_channels:1)
[thor001:0:15042 - comm.c:393] INFO [group#:0] group id:a tree idx:0 tree_type:LLT rail_idx:0 group size:1 quota: (osts:2 user_data_per_ost:1024) mgid: (subnet prefix:0xff12a01bfe800000 interface id:0x3f020000000a) mlid:c007
Test Passed.

Example #2

Copy
Copied!

            
            $ SHARP_COLL_ENABLE_SAT=1 sharp_hello -d mlx5_0:1 -v 3
 
[swx-dgx01:0:59023 - context.c:581] INFO job (ID: 15134963379905498623) resource request quota: ( osts:0 user_data_per_ost:0 max_groups:0 max_qps:1 max_group_channels:1, num_trees:1)
[swx-dgx01:0:59023 - context.c:751] INFO tree_info: type:LLT tree idx:0 treeID:0x0 caps:0x6 quota: ( osts:167 user_data_per_ost:1024 max_groups:167 max_qps:1 max_group_channels:1)
[swx-dgx01:0:59023 - context.c:755] INFO tree_info: type:SAT tree idx:1 treeID:0x3f caps:0x16
[swx-dgx01:0:59023 - comm.c:393] INFO [group#:0] group id:3c tree idx:0 tree_type:LLT rail_idx:0 group size:1 quota: (osts:2 user_data_per_ost:1024) mgid: (subnet prefix:0xff12a01bfe800000 interface id:0xd6060000003c) mlid:c004
[swx-dgx01:0:59023 - comm.c:393] INFO [group#:1] group id:3c tree idx:1 tree_type:SAT rail_idx:0 group size:1 quota: (osts:64 user_data_per_ost:0) mgid: (subnet prefix:0x0 interface id:0x0) mlid:0
Test Passed

On This Page