Operating NVIDIA SHARP in Dynamic Trees Allocation Mode

A SHARP tree defines a set of switches and their connected links to be used by one or more SHARP jobs.

  • A single tree can be used by multiple jobs, as long as they are using different areas of the tree.

  • A single job can also utilize multiple trees, in case the job is operating on multiple rails, while each rail can use a different tree.

By default, sharp_am operates in Static trees mode. In this mode, SHARP trees are created in the sharp_am initialization phase. When a new SHARP job starts, it is assigned to one of the existing SHARP trees that is available to operate the job.

When sharp_am operates in Dynamic trees mode, trees are not created in the initialization phase. Instead, they are created per job, immediately assigned to the job that requires them, and are deleted once the job ends.

The Dynamic trees mode of operation has some benefits over the Static trees mode, as it defines the SHARP configuration on the switches only when necessary, and enables better utilization of the fabric resource. There are various scenarios in which a Static mode of operation may respond with “No resources” to a SHARP job request, while in Dynamic mode, the SHARP job would be fulfilled.

sharp_am provides two different algorithms that can be used to determine how trees should be created for each job. One algorithm is optimal for SuperPOD fabrics, while the other is optimal for Quasi Fat Trees (QFTs).

Note the following:

  • Only one algorithm can be used at a given time

  • sharp_am should be restarted when switching algorithms

It is recommended to consult with NVIDIA experts regarding the suitable algorithm for your system. You can contact us through either of the following methods:

E-mail: Enterprisesupport@nvidia.com

Enterprise Support page: https://www.nvidia.com/en-us/support/enterprise

To operate in Dynamic trees mode, set dynamic_tree_allocation parameter to TRUE.

By default, the SuperPOD-oriented algorithm is used. To switch to the QFT-oriented algorithm, use the dynamic_tree_algorithm parameter.

If the number of root switches in the fabric is larger than 126 when using the SuperPOD-oriented algorithm, it is desired to modify max_trees_to_build to be equal to the number of root switches.

Note that sharp_am restart is required for the configuration to take effect.

  • Dynamic trees allocation mode is currently available for fat-tree and Quasi-Fat-Tree (QFT) topologies only, and is not supported for Dragonfly or hypercube topologies. In case sharp_am is configured to operate in Dynamic mode and the topology does not match, sharp_am will automatically operate in Static mode.

  • When operating in Dynamic trees mode, ibdiagnet may print warning messages about the existence of multiple distinct trees with the same tree ID. In Dynamic trees mode, this is a valid situation and these warnings should be ignored.
    Warning example: -W- <> - In Node <> found root tree (parent qpn <>) which is already exists for treeID: <>Note: You can avoid this warning by adding the following parameters to the ibdiagnet command line:--sharp_opt ad_hoc

  • Dynamic trees creation does not support a case in which all root switches are down and restarted. If such a scenario takes place, sharp_am should be restarted once the root switches are up and running.

© Copyright 2023, NVIDIA. Last updated on May 24, 2023.