NVIDIA SHARP Installation
NVIDIA SHARP consists of two main components: sharp_am, which runs in UFM, and libsharp, which is linked to client applications running on the compute nodes.
The recommended setup is to enable SHARP in UFM and install libsharp on the compute nodes using HPC-X.
Alternatively, SHARP can be installed via DOCA-Host; this approach is detailed in Appendix E: Using NVIDIA SHARP from DOCA-Host.
Starting UFM v6.23.0, sharp_enabledis set to true by default.
UFM Enterprise upgrade maintains the previous configuration. In this case, verify that the configuration file conf/gv.cfg includes the following setting:
[sharp]
sharp_enabled = true
When using the UFM Appliance, you can also run the command lib sharp enable, which applies the same configuration change.
In both UFM Enterprise and UFM Appliance, a UFM restart is required after making the change.
The libsharp component (SHARP client) is installed on compute nodes using the HPC-X package.
Refer to the HPC-X User Manual for installation and configuration instructions, and ensure that HPC-X is installed in a shared directory accessible to all compute nodes.
For usage details after installation, see Using SHARP section.
To download the HPC-X packages, please visit here.
While ibdiagnet can be used to check for general fabric errors, it does not verify communication between libsharp and sharp_am, nor whether SHARP jobs can be successfully created.
To test the SHARP installation and configuration, two test applications are available in the sharp/bin directory. Both are designed to verify client-side behavior:
sharp_hello– Verifies that a SHARP client can communicate withsharp_amand successfully create a SHARP job. The test creates a simple SHARP tree involving only one compute node. It checks job creation only and does not perform data aggregation.sharp_coll_test– Performs a more comprehensive test. It measures bandwidth between the node’s HCA and its connected switch, and also verifies SHARP data transfer. This makes it a more thorough diagnostic thansharp_hello, as it checks both job creation and data aggregation.
Running Tests from the UFM Machine
Although it is recommended to run SHARP tests from a compute node, testing from the UFM machine is possible with a configuration change.
By default, sharp_am blocks libsharp from running on management nodes (including the active and standby UFM machines). To allow testing from the UFM node:
Edit the configuration file:
conf/sharp/sharp_am.cfgSet the following parameter:
ignore_sm_guids = FalseRestart
sharp_amfor the changes to take effect.
NVIDIA SHARP Hello
NVIDIA SHARP distribution provides sharp_hello test utility for testing SHARP's end-to-end functionality on a compute node. It creates a single SHARP job and sends a barrier request to SHARP Aggregation node.
Help
$sharp_hello -h
usage: sharp_hello <-d | --ib_dev> <device> [OPTIONS]
OPTIONS:
[-d | --ib_dev] - HCA to use
[-v | --verbose] - libsharp coll verbosity level(default:2)
Levels: (0-fatal 1-err 2-warn 3-info 4-debug 5-trace)
[-V | --version] - print program version
[-h | --help] - show this usage
Example #1
$ sharp_hello -d mlx5_0:1 -v 3
[thor001:0:15042 - context.c:581] INFO job (ID: 12159720107860141553) resource request quota: ( osts:0 user_data_per_ost:0 max_groups:0 max_qps:1 max_group_channels:1, num_trees:1)
[thor001:0:15042 - context.c:751] INFO tree_info: type:LLT tree idx:0 treeID:0x0 caps:0x6 quota: ( osts:167 user_data_per_ost:1024 max_groups:167 max_qps:1 max_group_channels:1)
[thor001:0:15042 - comm.c:393] INFO [group#:0] group id:a tree idx:0 tree_type:LLT rail_idx:0 group size:1 quota: (osts:2 user_data_per_ost:1024) mgid: (subnet prefix:0xff12a01bfe800000 interface id:0x3f020000000a) mlid:c007
Test Passed.
Example #2
$ SHARP_COLL_ENABLE_SAT=1 sharp_hello -d mlx5_0:1 -v 3
[swx-dgx01:0:59023 - context.c:581] INFO job (ID: 15134963379905498623) resource request quota: ( osts:0 user_data_per_ost:0 max_groups:0 max_qps:1 max_group_channels:1, num_trees:1)
[swx-dgx01:0:59023 - context.c:751] INFO tree_info: type:LLT tree idx:0 treeID:0x0 caps:0x6 quota: ( osts:167 user_data_per_ost:1024 max_groups:167 max_qps:1 max_group_channels:1)
[swx-dgx01:0:59023 - context.c:755] INFO tree_info: type:SAT tree idx:1 treeID:0x3f caps:0x16
[swx-dgx01:0:59023 - comm.c:393] INFO [group#:0] group id:3c tree idx:0 tree_type:LLT rail_idx:0 group size:1 quota: (osts:2 user_data_per_ost:1024) mgid: (subnet prefix:0xff12a01bfe800000 interface id:0xd6060000003c) mlid:c004
[swx-dgx01:0:59023 - comm.c:393] INFO [group#:1] group id:3c tree idx:1 tree_type:SAT rail_idx:0 group size:1 quota: (osts:64 user_data_per_ost:0) mgid: (subnet prefix:0x0 interface id:0x0) mlid:0
Test Passed