NVIDIA SHARP Installation
NVIDIA SHARP consists of two main components: sharp_am, which runs in UFM, and libsharp, which is linked to client applications running on the compute nodes.
The recommended setup is to enable SHARP in UFM and install libsharp on the compute nodes using HPC-X.
Alternatively, SHARP can be installed via DOCA-Host; this approach is detailed in Appendix E: Using NVIDIA SHARP from DOCA-Host.
When using UFM Enterprise, modify the configuration file conf/gv.cfg and set sharp_enabledparameter to true:
[sharp]
sharp_enabled = true
When using the UFM Appliance, you can also run the command lib sharp enable, which applies the same configuration change.
In both UFM Enterprise and UFM Appliance, a UFM restart is required after making the change.
SHARP Network Settings in UFM
By default, sharp_am communicates with libsharp clients over IP over InfiniBand (IPoIB) using TCP port 6126.
To enable pure InfiniBand-based communication via UCX, edit the SHARP configuration file at conf/sharp/sharp_am.cfg and set the smx_protocol parameter to 1. A restart of sharp_am is required for the change to take effect.
UFM Appliance Firewall Settings
UFM Appliance Gen 3.x includes a firewall that, by default, blocks the TCP port used by sharp_am, preventing SHARP clients from connecting when TCP is used.
To allow this communication, open the required port by running:
ufw allow 6126/tcp
The libsharp component (SHARP client) is installed on compute nodes using the HPC-X package.
Refer to the HPC-X User Manual for installation and configuration instructions, and ensure that HPC-X is installed in a shared directory accessible to all compute nodes.
For usage details after installation, see Using SHARP section.
To download the HPC-X packages, please visit here.
While ibdiagnet can be used to check for general fabric errors, it does not verify communication between libsharp and sharp_am, nor whether SHARP jobs can be successfully created.
To test the SHARP installation and configuration, two test applications are available in the sharp/bin directory. Both are designed to verify client-side behavior:
sharp_hello– Verifies that a SHARP client can communicate withsharp_amand successfully create a SHARP job. The test creates a simple SHARP tree involving only one compute node. It checks job creation only and does not perform data aggregation.sharp_coll_test– Performs a more comprehensive test. It measures bandwidth between the node’s HCA and its connected switch, and also verifies SHARP data transfer. This makes it a more thorough diagnostic thansharp_hello, as it checks both job creation and data aggregation.
Running Tests from the UFM Machine
Although it is recommended to run SHARP tests from a compute node, testing from the UFM machine is possible with a configuration change.
By default, sharp_am blocks libsharp from running on management nodes (including the active and standby UFM machines). To allow testing from the UFM node:
Edit the configuration file:
conf/sharp/sharp_am.cfgSet the following parameter:
ignore_sm_guids = FalseRestart
sharp_amfor the changes to take effect.
NVIDIA SHARP Hello
NVIDIA SHARP distribution provides sharp_hello test utility for testing SHARP's end-to-end functionality on a compute node. It creates a single SHARP job and sends a barrier request to SHARP Aggregation node.
Help
$sharp_hello -h
usage: sharp_hello <-d | --ib_dev> <device> [OPTIONS]
OPTIONS:
[-d | --ib_dev] - HCA to use
[-v | --verbose] - libsharp coll verbosity level(default:2)
Levels: (0-fatal 1-err 2-warn 3-info 4-debug 5-trace)
[-V | --version] - print program version
[-h | --help] - show this usage
Example #1
$ sharp_hello -d mlx5_0:1 -v 3
[thor001:0:15042 - context.c:581] INFO job (ID: 12159720107860141553) resource request quota: ( osts:0 user_data_per_ost:0 max_groups:0 max_qps:1 max_group_channels:1, num_trees:1)
[thor001:0:15042 - context.c:751] INFO tree_info: type:LLT tree idx:0 treeID:0x0 caps:0x6 quota: ( osts:167 user_data_per_ost:1024 max_groups:167 max_qps:1 max_group_channels:1)
[thor001:0:15042 - comm.c:393] INFO [group#:0] group id:a tree idx:0 tree_type:LLT rail_idx:0 group size:1 quota: (osts:2 user_data_per_ost:1024) mgid: (subnet prefix:0xff12a01bfe800000 interface id:0x3f020000000a) mlid:c007
Test Passed.
Example #2
$ SHARP_COLL_ENABLE_SAT=1 sharp_hello -d mlx5_0:1 -v 3
[swx-dgx01:0:59023 - context.c:581] INFO job (ID: 15134963379905498623) resource request quota: ( osts:0 user_data_per_ost:0 max_groups:0 max_qps:1 max_group_channels:1, num_trees:1)
[swx-dgx01:0:59023 - context.c:751] INFO tree_info: type:LLT tree idx:0 treeID:0x0 caps:0x6 quota: ( osts:167 user_data_per_ost:1024 max_groups:167 max_qps:1 max_group_channels:1)
[swx-dgx01:0:59023 - context.c:755] INFO tree_info: type:SAT tree idx:1 treeID:0x3f caps:0x16
[swx-dgx01:0:59023 - comm.c:393] INFO [group#:0] group id:3c tree idx:0 tree_type:LLT rail_idx:0 group size:1 quota: (osts:2 user_data_per_ost:1024) mgid: (subnet prefix:0xff12a01bfe800000 interface id:0xd6060000003c) mlid:c004
[swx-dgx01:0:59023 - comm.c:393] INFO [group#:1] group id:3c tree idx:1 tree_type:SAT rail_idx:0 group size:1 quota: (osts:64 user_data_per_ost:0) mgid: (subnet prefix:0x0 interface id:0x0) mlid:0
Test Passed