Operating NVIDIA SHARP in Dynamic Trees Allocation Mode
On This Page
A SHARP tree defines a set of switches and their connected links to be used by one or more SHARP jobs.
A single tree can be used by multiple jobs, as long as they are using different areas of the tree.
A single job can also utilize multiple trees, in case the job is operating on multiple rails, while each rail can use a different tree.
By default, sharp_am operates in Static trees mode. In this mode, SHARP trees are created in the sharp_am initialization phase. When a new SHARP job starts, it is assigned to one of the existing SHARP trees that is available to operate the job.
When sharp_am operates in Dynamic trees mode, trees are not created in the initialization phase. Instead, they are created per job, immediately assigned to the job that requires them, and are deleted once the job ends.
The Dynamic trees mode of operation has some benefits over the Static trees mode, as it defines the SHARP configuration on the switches only when necessary, and enables better utilization of the fabric resource. There are various scenarios in which a Static mode of operation may respond with “No resources” to a SHARP job request, while in Dynamic mode, the SHARP job would be fulfilled.
To operate in Dynamic trees mode, set dynamic_tree_allocation parameter to TRUE.
If the number of root switches in the fabric is larger than 126, it is desired to modify max_trees_to_build to be equal to the number of root switches.
sharp_am restart is required for the configuration to take effect.
Dynamic trees allocation mode is currently available for fat-tree topology only, and is not supported for dragonfly or hypercube topologies. In case sharp_am is configured to operate in Dynamic mode and the topology does not match, sharp_am will automatically operate in Static mode.
When operating in Dynamic trees mode, ibdiagnet may print warning messages about the existence of multiple distinct trees with the same tree ID. In Dynamic trees mode, this is a valid situation and these warnings should be ignored.
Warning example: -W- <> - In Node <> found root tree (parent qpn <>) which is already exists for treeID: <>
Dynamic trees creation does not support a case in which all root switches are down and restarted. If such a scenario takes place, sharp_am should be restarted once the root switches are up and running.