NVLink Partition Management#

NVLink Partitioning allows an NVLink rack-scale system to be divided into multiple hardware isolation domains.

In other words, you can isolate compute trays from each other at the NVLink level.

For example, half the compute trays can be in one NVLink partition and half the compute trays in another (within the same rack). Compute trays in one NVLink partition will not be able to establish Multi-Node NVLink connections with compute trays in a different NVLink partition.

The primary use case for doing this is multi-tenancy. Specifically in cases where the relative compute environment has security requirements that must ensure tenants of a shared resource are isolated from eachother.

NVLink partitioning is one piece of the puzzle to fulfill multi-tenancy requirements. It does not solve all aspects.
NVLink partitioning does not solve the isolation problem for other non-NVLink networks or storage.

Note

It is advised to make sure NVLink Partitioning is a necessary capability to fulfill your environemnts isolation requirements.
It is possible you may not need this level of isolation. Often, the isolation provided by the workload scheduler and IMEX domains is sufficient.

Please consult with your NVIDIA contact if you have questions.

NVLink Partitioning has the following properties:

Ensures that data paths are isolated in each partition.
Each NVLink is allocated to one tenant, which ensures exclusive access.
Isolation is configured and is managed by the NVSwitch Control Plane.
NVLinks are not shared across different partitions.
Partitions can be created with an arbitrary set of nodes in the NVLink domain.
Partitions can not span across NVLink domains.
A GPU belogs to only one partition at a time.
Partitions can exist without any GPUs allocated to them.

Access to the complete documentation set covering these concepts is available here on NVIDIA Docs Hub.

Maintenance Flows for Compute Trays#

If a compute tray is exhibiting faulty behavior, indicating a hardware problem that needs physical attention, a defined maintenance flow needs to be followed in order to isolate the compute tray from a working NVLink partition.

There are different workflows for different NVLink partition types.
In the case of an NMC environment we utilize UID-based partitions.

An overview of the maintenance workflow looks like this:

Create an Out For Repair (OFR) partition with numgpus=0.
When a faulty compute try needs to be isolated, remove the tray from the running partition.
Add the tray to the OFR partition.
(Optional) Power off the compute tray and remove it from the slot.
(Optional) Insert the tray back into the same slot and power on the tray.
Once maintenance is complete, proceed with the following tasks:
- Power off the compute tray in the OFR partition
- Remove the OFR partition.
- Add tray back to the running partition.

This is a high-level overview. Please consult the Partition Guides available here for detailed information.

Workload Management and Job Scheduling Considerations#

When adapting NVLink partitions it is important to consider how this impacts the workload management systems and job scheduling. In this document we will cover some basics.

How it specifically impacts workload management systems and job scheduling will depend upon the specific environment and multi-tenancy stance.
NMC supports two workload management systems and will therefore focus on those - Slurm and Run:ai.

Please consult your NVIDIA contact for more information.

Background#

When providing NVIDIA Multi-Node NVLink capable systems (i.e. NVL72 GB200/GB300 rack-scale systems) as shared resources to users, through Slurm and Run:ai, it is vital to understand the NVLink Domains and Partitions.

The goal is to make the workload management system aware of the NVLink Partitions so jobs can be scheduled onto nodes within the cluster, across racks, appropriately.
The main point is scheduling jobs within Multi-Node NVLink capable blocks (i.e. NVLink Partitions). This is where the highest performance is gauranteed.
These optimized blocks of nodes could be the full rack, if the default NVLink partition is used, or certain nodes in a rack confined to their own NVLink partition.

This NVLink topology-aware scheduling should be mostly transparent to the users.

What we discuss here is mostly for administrators of the NVL72 system.
The specifics of how to do this for your environment will depend upon your NVLink Partition configuration and your Slurm/Run:ai configuration.

Slurm#

Within Slurm, the topology/block plugin was developed in order to help job scheduling on NVIDIA’s Multi-Node NVLink capable rack-scale systems. With this plugin we are able to build a topology out of blocks of nodes.

A block is a logical grouping of nodes that reside within the same NVLink Partition.
Blocks are non-overlapping: a node can only belong to one block.
The block size is the unit you define, which should align with NVLink partitons. When using the default partition, or the full rack, the block size is 18 (all nodes in a rack).

The path to configuring these blocks within NMC is through Base Command Manager here:

[a03-u02-bcm-01->wlm[slurm]->topologysettings]% show
Parameter                        Value
-------------------------------- ------------------------------------------------
Revision
Topology plugin                  Block
Topology source                  Internal
Parameters                       <submode>
Tree settings                    <submode>
Block settings                   <submode>
Topograph settings               <submode>

[a03-u02-bcm-01->wlm[slurm]->topologysettings->blocksettings]% show
Parameter                        Value
-------------------------------- ------------------------------------------------
Block sizes
Revision
Block entity                     Rack
Allowed racks
Allowed nodegroups

Here is an example of the topology.conf file when the Block entity is “Rack”:

root@a03-u02-bcm-01:~# cat /cm/shared/apps/slurm/etc/slurm/topology.conf
# This section of this file was automatically generated by cmd. Do not edit manually!
# BEGIN AUTOGENERATED SECTION -- DO NOT REMOVE
BlockName=A02 Nodes=a02-p01-dgx-01-c[01-18]

In the above, we can see that we have a “BlockName” that is associated with a rack (i.e. “A02”).

The above configuration assumes that each rack is using the default partition, which is a full rack. It does not account for partitions within a rack. This is the default behavior.

In order to account for partitions within a rack, in the case that you delete the default partition and create new ones, you want to switch the “Block entity” value to “NodeGroup”.

[a03-u02-bcm-01->wlm[slurm]->topologysettings->blocksettings]% show
Parameter                        Value
-------------------------------- ------------------------------------------------
Block sizes
Revision
Block entity                     NodeGroup
Allowed racks
Allowed nodegroups

We can then configure a new NodeGroup and specific nodes for that NodeGroup:

[a03-u02-bcm-01->nodegroup[ng1]]% show
Parameter                        Value
-------------------------------- ------------------------------------------------
Name                             ng1
Revision
Nodes                            a02-p01-dgx-01-c01..a02-p01-dgx-01-c09

And our topology.conf file will reflect those changes:

root@a03-u02-bcm-01:~# cat /cm/shared/apps/slurm/etc/slurm/topology.conf
# This section of this file was automatically generated by cmd. Do not edit manually!
# BEGIN AUTOGENERATED SECTION -- DO NOT REMOVE
BlockName=ng1 Nodes=a02-p01-dgx-01-c[01-09]

Note

Designing proper topology awareness can require expert knowledge about the system. Additionally, it requires clear understanding of the multi-tenancy requirements for the target environment.
We did not discuss Slurm Partitions in this context. That is another layer of configuration complexity that should be considered when designing around NVLink Partitions.

There is not one size that fits all. Please consult your NVIDIA contact for more information and guidance.

Run:ai#

Run:ai automatically detects NVLink Partitions (or Multi-Node NVLink Domains) for us. Some details are below for those interested.
Run:ai taps into node labels provided by the NVIDIA GPU Operator and these labels are identified as “nvidia.com/gpu.clique”.

Recall, that we can see the clique in BCM as well:

[a03-u02-bcm-01->device]% nvdomaininfo -r a02
Node                GPU      Status   Cluster UUID                           Clique ID
------------------- -------- -------- -------------------------------------- ------------
a02-p01-dgx-01-c01  0        Success  a868ae12-3336-47d9-82cc-85febb8ed722   1
02-p01-dgx-01-c01  1        Success  a868ae12-3336-47d9-82cc-85febb8ed722    1
a02-p01-dgx-01-c01  2        Success  a868ae12-3336-47d9-82cc-85febb8ed722   1
02-p01-dgx-01-c01  3        Success  a868ae12-3336-47d9-82cc-85febb8ed722    1
a02-p01-dgx-01-c02  0        Success  a868ae12-3336-47d9-82cc-85febb8ed722   32766
a02-p01-dgx-01-c02  1        Success  a868ae12-3336-47d9-82cc-85febb8ed722   32766
a02-p01-dgx-01-c02  2        Success  a868ae12-3336-47d9-82cc-85febb8ed722   32766
a02-p01-dgx-01-c02  3        Success  a868ae12-3336-47d9-82cc-85febb8ed722   32766

The above shows that nodes a02-p01-dgx-01-c01 and a02-p01-dgx-01-c02 are in different NVLink Partitions.
Run:ai discovers this via the “nvidia.com/gpu.clique” label that is applied to the GB200/GB300 compute trays that are allocated as Run:ai GPU worker nodes.

Another way to see the node label is through the “kubectl” command:

root@a03-u02-bcm-01:~# kubectl describe nodes | grep clique
nvidia.com/gpu.clique=868ae12-3336-47d9-82cc-85febb8ed722.32766
nvidia.com/gpu.clique=868ae12-3336-47d9-82cc-85febb8ed722.1

Note

What is described above for Run:ai’s ability to autodetect Multi-Node NVlink Domains is very basic and showcases the default behavior.

Typically the default behavior is sufficient and customizations are not needed, but please consult with your NVIDIA contacts.

NVLink Partition Management#

Adapting NVLink Partitions#

Multi-Cast Support#

NVLink Partition Creation Example#

Batch GPU Operations via gRPC#

Maintenance Flows for Compute Trays#

Workload Management and Job Scheduling Considerations#

Background#

Slurm#

Run:ai#