Skip to main content
Ctrl+K
NVIDIA Mission Control Software Administration Guide - Home NVIDIA Mission Control Software Administration Guide - Home

NVIDIA Mission Control Software Administration Guide

NVIDIA Mission Control Software Administration Guide - Home NVIDIA Mission Control Software Administration Guide - Home

NVIDIA Mission Control Software Administration Guide

Table of Contents

NVIDIA Mission Control

  • Overview
  • Mission Control Software Stack
  • Node and Category Management
  • Slurm Workload Management
  • NVIDIA Run:ai Installation
  • Adding and Removing Nodes from Run:ai or Slurm
  • Observability Software
  • Connecting to NVIDIA Mission Control Autonomous Hardware Recovery
  • Out-of-Band Management
  • NVLink Partition Management
  • NVLink Management Software (NMX + NetQ)
  • Leak Detection
  • Backups

Power Reservation Steering

  • Introduction
  • Concepts and Components
  • Installation
  • Advanced Configuration
  • Metrics
  • Troubleshooting
  • FAQ

Autonomous Job Recovery

  • Introduction
  • Configuration
  • Accessing Clusters
  • Accessing Dashboards
  • Monitoring and Logs
  • Grafana Cloud Setup
  • Installing and Upgrading AJR
  • Example Commands
  • Viewing Job Details
  • Accessing the Cockpit
  • AJR Job Monitoring
  • Confirming AJR is Operational
  • How-to: Toggle Dry-Run Mode
  • Debugging Common Issues

Workload Power Profile Solution (WPPS)

  • Introduction
  • Components and Concepts
  • Installation
  • First Slurm Job with WPPS
  • Frequently Asked Questions
  • NVLink Partition Management

NVLink Partition Management#

NVLink Partitioning allows an NVLink rack-scale system to be divided into multiple hardware isolation domains.

In other words, you can isolate compute trays from each other at the NVLink level.

For example, half the compute trays can be in one NVLink partition and half the compute trays in another (within the same rack). Compute trays in one NVLink partition will not be able to establish Multi-Node NVLink connections with compute trays in a different NVLink partition.

The primary use case for doing this is multi-tenancy. Specifically in cases where the relative compute environment has security requirements that must ensure tenants of a shared resource are isolated from eachother.

  • NVLink partitioning is one piece of the puzzle to fulfill multi-tenancy requirements. It does not solve all aspects.

  • NVLink partitioning does not solve the isolation problem for other non-NVLink networks or storage.

Note

  • It is advised to make sure NVLink Partitioning is a necessary capability to fulfill your environemnts isolation requirements.

  • It is possible you may not need this level of isolation. Often, the isolation provided by the workload scheduler and IMEX domains is sufficient.

Please consult with your NVIDIA contact if you have questions.

NVLink Partitioning has the following properties:

  • Ensures that data paths are isolated in each partition.

  • Each NVLink is allocated to one tenant, which ensures exclusive access.

  • Isolation is configured and is managed by the NVSwitch Control Plane.

  • NVLinks are not shared across different partitions.

  • Partitions can be created with an arbitrary set of nodes in the NVLink domain.

  • Partitions can not span across NVLink domains.

  • A GPU belogs to only one partition at a time.

  • Partitions can exist without any GPUs allocated to them.

Access to the complete documentation set covering these concepts is available here on NVIDIA Docs Hub.

Adapting NVLink Partitions#

Note

WARNING: Altering NVLink Partitions is disruptive to workloads. Please be sure to set aside a maintenance window, pausing user workloads, when doing this in a production environment.

Creating and managing NVLink Partitions can be handled through different mechanisms.

  • This document will cover the process of going directly to the NVSwitch and managing NVLink partitions through the NVIDIA Operating System (NVOS) commands available directly on the switch.

  • This document will not cover other methods available through APIs.

Note

  • The NVIDIA Mission Control Software will provide tighter integration with NVLink Partition Management APIs in a future release.

  • There are currently APIs available for third-party integration or customer driven integration.

Please consult with your NVIDIA contact if you have further questions.

Multi-Cast Support#

Before we begin, it is important to discuss multi-cast support.

  • GB200/GB300 systems support setting up to 1,024 multicast teams.

  • This 1,024 value is a global resource and is shared across all the partitions.

Note

The Default Partition is allocated all 1,024 multicast teams. In order to create new partitions with multicast teams you have to delete the default partition.

This value can be seen by looking at a partitions information from the NVOS CLI:

root@a02-p01-nvsw-01:~# nv show sdn partition 32766
                operational
---------------  ------------------
name             Default Partition
num-gpus         72
health           healthy
resiliency-mode  adaptive_bandwidth
mcast-limit      1024
partition-type   gpuuid_based

The number of teams to reserve to a partition is determined on the expected usage by the partition’s tenant.

  • Smaller partitions should be allocated fewer multicast groups than larger partitions.

  • Ther performance benefit with multicast teams is limited for smaller sized partitions.

  • Partitions that have less than or equal to four GPUs will not benefit from multicast and can accomplish all traffic through unicast.

  • This multicast value must be 0 or a multiple of 4 (otherwise the partitioning request will fail)

Here is a simple allocation algorithm for each partition:

floor(3.55*NGPUs) * 4

In this case, the allocation increases proportional to the number of GPUs in the partition.

  • A 72-GPU partition will get 1024 allocations

  • A 68-GPU partition will get 964 allocations

  • A 64-GPU partition will get 908 allocations

NVLink Partition Creation Example#

  1. SSH to the leader NVSwitch within the target NVL72 rack. This is typically the first switch in the rack. It will be running the NMX-C and NMX-T services:

root@a02-p01-nvsw-01:~# nv show cluster apps
Name            ID             Version  Capabilities            Components Version                              Status  Reason  Additional Information          Summary
--------------  -------------  -------  ----------------------  ----------------------------------------------  ------  ------  ------------------------------  -------
nmx-controller  nmx-c-nvos     3.0.216  sm, gfm, fib, gw-api    sm:2025.06.5, gfm:580.82.11, fib:3.0.216        ok              CONTROL_PLANE_STATE_CONFIGURED
nmx-telemetry   nmx-telemetry  3.0.12   nvl, gnmi, syslog, bmc  nvl:1.22.3, aggr:1.19.6, cm:1.1.10, gnmi:1.3.4  ok
  1. After confirming the health of NMX-C and NMX-T on the leader NVSwitch we can take a look at the default partition (you should see all the currently healthy GPUs listed - max of 72 of NVL72 system)

root@a02-p01-nvsw-01:~# nv show sdn partition 32766
                operational
---------------  ------------------
name             Default Partition
num-gpus         72
health           healthy
resiliency-mode  adaptive_bandwidth
mcast-limit      1024
partition-type   gpuuid_based



locations
============
GPU Location  UUID
------------  --------------------
1.1.1.1       8357015112388089928
1.1.1.2       3611653655934465341
1.1.1.3       5697956643028084298
1.1.1.4       13132800523870289992
1.2.1.1       5518545099473753111
1.2.1.2       7272281881278735529
1.2.1.3       6013361446604380921
1.2.1.4       6312466023165173772
1.3.1.1       8712847454219224873
1.3.1.2       17981856043131365986
1.3.1.3       11595191787482933007
1.3.1.4       9244802223950335906
1.4.1.1       9505350209225944298
1.4.1.2       6907950936280763675
1.4.1.3       17048535318069100821
1.4.1.4       9796042342979276058
1.5.1.1       11893553907981959763
1.5.1.2       12758730871500842809
1.5.1.3       18075172690483246066
1.5.1.4       18141154896230839043
1.6.1.1       5754401769387435280
1.6.1.2       14519491321610656308
1.6.1.3       3191433981021167721
1.6.1.4       8520759092911582246
1.7.1.1       1802959570104224269
1.7.1.2       4308329049834232013
1.7.1.3       4161984048399566877
1.7.1.4       1905967062469685968
1.8.1.1       1814191112675638386
1.8.1.2       874543664049122058
1.8.1.3       1192667468722111650
1.8.1.4       3318734222720160942
1.18.1.1      1585338091237138071
1.18.1.2      5261427084183044488
1.18.1.3      15857209134130120733
1.18.1.4      16702437360055396986
1.19.1.1      16553123489437878365
1.19.1.2      10867913026984328954
1.19.1.3      14601717362032341349
1.19.1.4      345983471647337443
1.20.1.1      17715758010797999246
1.20.1.2      7398104818136756066
1.20.1.3      11762310800770940421
1.20.1.4      13426417461441282974
1.21.1.1      2973714617282940841
1.21.1.2      16036427814672230667
1.21.1.3      3586420395966251791
1.21.1.4      8906897475437660068
1.22.1.1      16957046758513028024
1.22.1.2      3280146108569681629
1.22.1.3      9770830614183995909
1.22.1.4      14085592460229543474
1.23.1.1      12066986062655997037
1.23.1.2      1566773765412201920
1.23.1.3      11540823515532918308
1.23.1.4      6558807212641707636
1.24.1.1      6790686493896794832
1.24.1.2      3033817354205489959
1.24.1.3      17652298479725141362
1.24.1.4      1197963356616139935
1.25.1.1      3951046474381450256
1.25.1.2      8967467691473507419
1.25.1.3      11063919093991180836
1.25.1.4      9936424425473242695
1.26.1.1      17358992763701212968
1.26.1.2      16657178430542214144
1.26.1.3      1175386945681850669
1.26.1.4      13512042552851862350
1.27.1.1      15679035461973358023
1.27.1.2      14104031807953940632
1.27.1.3      6727688793289848219
1.27.1.4      10169156919294575395
  1. We can additionally look on the BCM side to confirm it sees the same information (output is shortened):

[a03-u02-bcm-01->device]% nvplatforminfo -r a02
Node                GPU      IB GUID             Serial                                Slot     Tray     Host ID  Peer type Module ID
------------------- -------- ------------------- ------------------------------------- -------- -------- -------- --------- ---------
a02-p01-dgx-01-c01  0        0x321f2bdd5db6f53d  31383233-3332-3532-3034-303036000000  1        0        1        1         2
a02-p01-dgx-01-c01  1        0x73fa14f7a4769848  31383233-3332-3532-3034-303036000000  1        0        1        1         1
a02-p01-dgx-01-c01  2        0xb641147ef35a4048  31383233-3332-3532-3034-303036000000  1        0        1        1         4
a02-p01-dgx-01-c01  3        0x4f133560b544264a  31383233-3332-3532-3034-303036000000  1        0        1        1         3

Note

The output above from BCM, using the IB GUID field, matches these nodes from the NVSwitch output:

GPU Location  UUID
------------  --------------------
1.1.1.1       8357015112388089928
1.1.1.2       3611653655934465341
1.1.1.3       5697956643028084298
1.1.1.4       13132800523870289992

NVSwitch displays values in decimal and BCM displays the value in hexadecimal. Convert the UUID above to hexadecimal and you will see the match.

  1. Now that we are oriented on how GPU IDs are mapped between BCM and the NVSwitch, lets delete the default partition so we can then create two new partitions (again, this done by running NVOS commands directly on the leader NVswitch).

root@a02-p01-nvsw-01:~# nv action delete sdn  partition 32766
Action executing ...
Deleting a partition: 32766
Action executing ...
Partition 32766 is successfully deleted
Action succeeded
  1. At this point, there are no partitions. We will create two new ones so the NVL72 rack is split into to NVLink partitions.

root@a02-p01-nvsw-01:~# nv action create sdn partition 1 name part1 resiliency-mode adaptive_bandwidth mcast-limit 512
Action executing ...
Partition 1 is successfully created
Action succeeded
root@a02-p01-nvsw-01:~# nv action create sdn partition 2 name part2 resiliency-mode adaptive_bandwidth mcast-limit 512
Action executing ...
Partition 2 is successfully created
Action succeeded
root@a02-p01-nvsw-01:~# nv show sdn partition
ID  Name   Num of GPUs  Health   Resiliency mode     Multicast groups limit  Partition type  Summary
--  -----  -----------  -------  ------------------  ----------------------  --------------  -------
1   part1  0            healthy  adaptive_bandwidth  512
2   part2  0            healthy  adaptive_bandwidth  512

Note

  • Now we have two NVLink partitions with zero GPUs each. At this point some scripting helps.

  • You will want to decide which GPUs you want in each NVLink partition and loop through like this:

nv action update sdn partition "$PARTITION_ID" uuid "$UUID"
nv action update sdn partition "$PARTITION_ID" location "$LOCATION"
nv action restore sdn partition "$PARTITION_ID" uuid "$UUID"
nv action restore sdn partition "$PARTITION_ID" location "$LOCATION"

Below is an example of the command for adding a GPU to partition 1.

  • Please refer to the GPU mapping in Step 2 to understand how UUID or location maps to a node’s GPU in BCM.

root@a02-p01-nvsw-01:~# nv action update sdn partition 1 uuid 8357015112388089928
Action executing ...
Updating uuid 8357015112388089928 in partition 1
Action executing ...
Partition 1 uuid 8357015112388089928 has been successfully updated
Action succeeded
root@a02-p01-nvsw-01:~# nv action update sdn partition 1 location 1.1.1.1
Action executing ...
Updating location 1.1.1.1 in partition 1
Action executing ...
Partition 1 location 1.1.1.1 has been successfully updated
Action succeeded

In this example, after repeating the steps above 72 times (once for each GPU in the rack), the rack is divided into two separate NVLink partitions. The output below shows that each partition contains 36 GPUs.

root@a02-p01-nvsw-01:~# nv show sdn partition
ID  Name   Num of GPUs  Health   Resiliency mode     Multicast groups limit  Partition type  Summary
--  -----  -----------  -------  ------------------  ----------------------  --------------  -------
1   part1  36           healthy  adaptive_bandwidth  512                     gpuuid_based
2   part2  36           healthy  adaptive_bandwidth  512                     gpuuid_based
root@a02-p01-nvsw-01:~# nv show sdn partition
ID  Name   Num of GPUs  Health   Resiliency mode     Multicast groups limit  Partition type    Summary
--  -----  -----------  -------  ------------------  ----------------------  ----------------  -------
1   part1  36           healthy  adaptive_bandwidth  512                     location_based
2   part2  36           healthy  adaptive_bandwidth  512                     location_based
  1. Check that BCM is seeing the NVLink Partition changes.

[a03-u02-bcm-01->device]% nvdomaininfo -r a02
Node                GPU      Status   Cluster UUID                           Clique ID
------------------- -------- -------- -------------------------------------- ------------
a02-p01-dgx-01-c01  0        Success  a868ae12-3336-47d9-82cc-85febb8ed722   553811970
02-p01-dgx-01-c01  1        Success  a868ae12-3336-47d9-82cc-85febb8ed722   553811969
a02-p01-dgx-01-c01  2        Success  a868ae12-3336-47d9-82cc-85febb8ed722   553811972
02-p01-dgx-01-c01  3        Success  a868ae12-3336-47d9-82cc-85febb8ed722   553811971
a02-p01-dgx-01-c02  0        Success  a868ae12-3336-47d9-82cc-85febb8ed722   32766
a02-p01-dgx-01-c02  1        Success  a868ae12-3336-47d9-82cc-85febb8ed722   32766
a02-p01-dgx-01-c02  2        Success  a868ae12-3336-47d9-82cc-85febb8ed722   32766
a02-p01-dgx-01-c02  3        Success  a868ae12-3336-47d9-82cc-85febb8ed722   32766

As another check, we can also look on the compute tray directly and match this to what it sees:

root@a02-p01-dgx-01-c01:~# nvidia-smi -q | grep Fabric -A 3
        GPU Fabric GUID                   : 0x321f2bdd5db6f53d
Inforom Version
        Image Version                     : G548.0301.00.03
        OEM Object                        : 2.1
--
Fabric
        State                             : Completed
        Status                            : Success
        CliqueId                          : 553811970
--
        GPU Fabric GUID                   : 0x73fa14f7a4769848
Inforom Version
        Image Version                     : G548.0301.00.03
        OEM Object                        : 2.1
--
Fabric
        State                             : Completed
        Status                            : Success
        CliqueId                          : 553811969
--
        GPU Fabric GUID                   : 0xb641147ef35a4048
Inforom Version
        Image Version                     : G548.0301.00.03
        OEM Object                        : 2.1
--
Fabric
        State                             : Completed
        Status                            : Success
        CliqueId                          : 553811972
--
        GPU Fabric GUID                   : 0x4f133560b544264a
Inforom Version
        Image Version                     : G548.0301.00.03
        OEM Object                        : 2.1
--
Fabric
        State                             : Completed
        Status                            : Success
        CliqueId                          : 553811971

Note

WARNING

  • What we see above is that the Clique ID for the GPUs in node a02-p01-dgx-01-c01 have changed. Although, the “Clique ID” should match the partition ID we created - either “1” or “2”.

  • We need to reset the GPUs (or reboot the nodes) in order for the Clique ID to update.

  1. Reset the GPUs (or reboot the compute trays) once all the partitioning changes are done on the NVSwitch side.

root@a02-p01-dgx-01-c11:~# systemctl stop cmd nvidia-persistenced.service nvidia-imex.service nvsm.service nvidia-dcgm slurmd nvidia-dcgm-exporter.service dcgm-exporter
Failed to stop dcgm-exporter.service: Unit dcgm-exporter.service not loaded.
root@a02-p01-dgx-01-c11:~# nvidia-smi --gpu-reset
GPU 00000008:06:00.0 was successfully reset.
GPU 00000009:06:00.0 was successfully reset.
GPU 00000018:06:00.0 was successfully reset.
GPU 00000019:06:00.0 was successfully reset.
All done.

Note

The above example just does a single compute trays GPUs. You can use some scripting or the ‘pdsh’ tool from the BCM head node to target multiple nodes at once.

  1. Re-check BCM’s view of the NVLink partitions after nodes have had their GPUs reset or rebooted.

[a03-u02-bcm-01->device]% nvdomaininfo -r a02
Node                GPU      Status   Cluster UUID                           Clique ID
------------------- -------- -------- -------------------------------------- ------------
a02-p01-dgx-01-c01  0        Success  a868ae12-3336-47d9-82cc-85febb8ed722   1
a02-p01-dgx-01-c01  1        Success  a868ae12-3336-47d9-82cc-85febb8ed722   1
a02-p01-dgx-01-c01  2        Success  a868ae12-3336-47d9-82cc-85febb8ed722   1
a02-p01-dgx-01-c01  3        Success  a868ae12-3336-47d9-82cc-85febb8ed722   1
a02-p01-dgx-01-c02  0        Success  a868ae12-3336-47d9-82cc-85febb8ed722   1
a02-p01-dgx-01-c02  1        Success  a868ae12-3336-47d9-82cc-85febb8ed722   1
a02-p01-dgx-01-c02  2        Success  a868ae12-3336-47d9-82cc-85febb8ed722   1
a02-p01-dgx-01-c02  3        Success  a868ae12-3336-47d9-82cc-85febb8ed722   1
a02-p01-dgx-01-c03  0        Success  22942071-f941-44bf-bed7-cee938fc6c7d   1
a02-p01-dgx-01-c03  1        Success  22942071-f941-44bf-bed7-cee938fc6c7d   1
a02-p01-dgx-01-c03  2        Success  22942071-f941-44bf-bed7-cee938fc6c7d   1
a02-p01-dgx-01-c03  3        Success  22942071-f941-44bf-bed7-cee938fc6c7d   1
a02-p01-dgx-01-c04  0        Success  a868ae12-3336-47d9-82cc-85febb8ed722   1
a02-p01-dgx-01-c04  1        Success  a868ae12-3336-47d9-82cc-85febb8ed722   1
a02-p01-dgx-01-c04  2        Success  a868ae12-3336-47d9-82cc-85febb8ed722   1
a02-p01-dgx-01-c04  3        Success  a868ae12-3336-47d9-82cc-85febb8ed722   1
a02-p01-dgx-01-c05  0        Success  a868ae12-3336-47d9-82cc-85febb8ed722   1
a02-p01-dgx-01-c05  1        Success  a868ae12-3336-47d9-82cc-85febb8ed722   1
a02-p01-dgx-01-c05  2        Success  a868ae12-3336-47d9-82cc-85febb8ed722   1
a02-p01-dgx-01-c05  3        Success  a868ae12-3336-47d9-82cc-85febb8ed722   1
a02-p01-dgx-01-c06  0        Success  a868ae12-3336-47d9-82cc-85febb8ed722   1
a02-p01-dgx-01-c06  1        Success  a868ae12-3336-47d9-82cc-85febb8ed722   1
a02-p01-dgx-01-c06  2        Success  a868ae12-3336-47d9-82cc-85febb8ed722   1
a02-p01-dgx-01-c06  3        Success  a868ae12-3336-47d9-82cc-85febb8ed722   1
a02-p01-dgx-01-c07  0        Success  a868ae12-3336-47d9-82cc-85febb8ed722   1
a02-p01-dgx-01-c07  1        Success  a868ae12-3336-47d9-82cc-85febb8ed722   1
a02-p01-dgx-01-c07  2        Success  a868ae12-3336-47d9-82cc-85febb8ed722   1
a02-p01-dgx-01-c07  3        Success  a868ae12-3336-47d9-82cc-85febb8ed722   1
a02-p01-dgx-01-c08  0        Success  a868ae12-3336-47d9-82cc-85febb8ed722   1
a02-p01-dgx-01-c08  1        Success  a868ae12-3336-47d9-82cc-85febb8ed722   1
a02-p01-dgx-01-c08  2        Success  a868ae12-3336-47d9-82cc-85febb8ed722   1
a02-p01-dgx-01-c08  3        Success  a868ae12-3336-47d9-82cc-85febb8ed722   1
a02-p01-dgx-01-c09  0        Success  a868ae12-3336-47d9-82cc-85febb8ed722   1
a02-p01-dgx-01-c09  1        Success  a868ae12-3336-47d9-82cc-85febb8ed722   1
a02-p01-dgx-01-c09  2        Success  a868ae12-3336-47d9-82cc-85febb8ed722   1
a02-p01-dgx-01-c09  3        Success  a868ae12-3336-47d9-82cc-85febb8ed722   1
a02-p01-dgx-01-c10  0        Success  a868ae12-3336-47d9-82cc-85febb8ed722   2
a02-p01-dgx-01-c10  1        Success  a868ae12-3336-47d9-82cc-85febb8ed722   2
a02-p01-dgx-01-c10  2        Success  a868ae12-3336-47d9-82cc-85febb8ed722   2
a02-p01-dgx-01-c10  3        Success  a868ae12-3336-47d9-82cc-85febb8ed722   2
a02-p01-dgx-01-c11  0        Success  a868ae12-3336-47d9-82cc-85febb8ed722   2
a02-p01-dgx-01-c11  1        Success  a868ae12-3336-47d9-82cc-85febb8ed722   2
a02-p01-dgx-01-c11  2        Success  a868ae12-3336-47d9-82cc-85febb8ed722   2
a02-p01-dgx-01-c11  3        Success  a868ae12-3336-47d9-82cc-85febb8ed722   2
a02-p01-dgx-01-c12  0        Success  a868ae12-3336-47d9-82cc-85febb8ed722   2
a02-p01-dgx-01-c12  1        Success  a868ae12-3336-47d9-82cc-85febb8ed722   2
a02-p01-dgx-01-c12  2        Success  a868ae12-3336-47d9-82cc-85febb8ed722   2
a02-p01-dgx-01-c12  3        Success  a868ae12-3336-47d9-82cc-85febb8ed722   2
a02-p01-dgx-01-c13  0        Success  a868ae12-3336-47d9-82cc-85febb8ed722   2
a02-p01-dgx-01-c13  1        Success  a868ae12-3336-47d9-82cc-85febb8ed722   2
a02-p01-dgx-01-c13  2        Success  a868ae12-3336-47d9-82cc-85febb8ed722   2
a02-p01-dgx-01-c13  3        Success  a868ae12-3336-47d9-82cc-85febb8ed722   2
a02-p01-dgx-01-c14  0        Success  a868ae12-3336-47d9-82cc-85febb8ed722   2
a02-p01-dgx-01-c14  1        Success  a868ae12-3336-47d9-82cc-85febb8ed722   2
a02-p01-dgx-01-c14  2        Success  a868ae12-3336-47d9-82cc-85febb8ed722   2
a02-p01-dgx-01-c14  3        Success  a868ae12-3336-47d9-82cc-85febb8ed722   2
a02-p01-dgx-01-c15  0        Success  a868ae12-3336-47d9-82cc-85febb8ed722   2
a02-p01-dgx-01-c15  1        Success  a868ae12-3336-47d9-82cc-85febb8ed722   2
a02-p01-dgx-01-c15  2        Success  a868ae12-3336-47d9-82cc-85febb8ed722   2
a02-p01-dgx-01-c15  3        Success  a868ae12-3336-47d9-82cc-85febb8ed722   2
a02-p01-dgx-01-c16  0        Success  a868ae12-3336-47d9-82cc-85febb8ed722   2
a02-p01-dgx-01-c16  1        Success  a868ae12-3336-47d9-82cc-85febb8ed722   2
a02-p01-dgx-01-c16  2        Success  a868ae12-3336-47d9-82cc-85febb8ed722   2
a02-p01-dgx-01-c16  3        Success  a868ae12-3336-47d9-82cc-85febb8ed722   2
a02-p01-dgx-01-c17  0        Success  a868ae12-3336-47d9-82cc-85febb8ed722   2
a02-p01-dgx-01-c17  1        Success  a868ae12-3336-47d9-82cc-85febb8ed722   2
a02-p01-dgx-01-c17  2        Success  a868ae12-3336-47d9-82cc-85febb8ed722   2
a02-p01-dgx-01-c17  3        Success  a868ae12-3336-47d9-82cc-85febb8ed722   2
a02-p01-dgx-01-c18  0        Success  a868ae12-3336-47d9-82cc-85febb8ed722   2
a02-p01-dgx-01-c18  1        Success  a868ae12-3336-47d9-82cc-85febb8ed722   2
a02-p01-dgx-01-c18  2        Success  a868ae12-3336-47d9-82cc-85febb8ed722   2
a02-p01-dgx-01-c18  3        Success  a868ae12-3336-47d9-82cc-85febb8ed722   2

This concludes all the steps necessary to create new NVLink partitions through the leader NVSwitch using the NVOS CLI.

Batch GPU Operations via gRPC#

The NVOS CLI updates partitions one GPU at a time. For batch operations, you can use gRPC calls to add or remove multiple GPUs in a single request. All gRPC commands must be run on the NVSwitch.

Prerequisites: Install grpcurl on the NVSwitch:

sudo apt update && sudo apt install wget -y
wget https://github.com/fullstorydev/grpcurl/releases/download/v1.9.2/grpcurl_1.9.2_linux_amd64.deb
sudo apt install ./grpcurl_1.9.2_linux_amd64.deb

Add GPUs to a Partition - Examples

grpcurl -plaintext -d '{
    "context": {},
    "gatewayId": "myGateway",
    "partitionId": {"partitionId": 1},
    "gpuUid": ["8357015112388089928", "3611653655934465341"],
    "reroute": true
}' 127.0.0.1:9371 nmx_c.NMX_Controller/AddGpusToPartition
grpcurl -plaintext -d '{
    "context": {},
    "gatewayId": "myGateway",
    "partitionId": {"partitionId": 1},
    "locationList": [
        {"loc": {"chassisId": 1, "slotId": 2, "hostId": 1}, "gpuId": 1},
        {"loc": {"chassisId": 1, "slotId": 2, "hostId": 1}, "gpuId": 2}
    ],
    "reroute": true
}' 127.0.0.1:9371 nmx_c.NMX_Controller/AddGpusToPartition

Remove GPUs from a Partition - Examples

grpcurl -plaintext -d '{
    "context": {},
    "gatewayId": "myGateway",
    "partitionId": {"partitionId": 1},
    "gpuUid": ["8357015112388089928", "3611653655934465341"],
    "reroute": true
}' 127.0.0.1:9371 nmx_c.NMX_Controller/RemoveGpusFromPartition
grpcurl -plaintext -d '{
    "context": {},
    "gatewayId": "myGateway",
    "partitionId": {"partitionId": 1},
    "locationList": [
        {"loc": {"chassisId": 1, "slotId": 1, "hostId": 1}, "gpuId": 1},
        {"loc": {"chassisId": 1, "slotId": 1, "hostId": 1}, "gpuId": 2}
    ],
    "reroute": true
}' 127.0.0.1:9371 nmx_c.NMX_Controller/RemoveGpusFromPartition

Maintenance Flows for Compute Trays#

If a compute tray is exhibiting faulty behavior, indicating a hardware problem that needs physical attention, a defined maintenance flow needs to be followed in order to isolate the compute tray from a working NVLink partition.

  • There are different workflows for different NVLink partition types.

  • In the case of an NMC environment we utilize UID-based partitions.

An overview of the maintenance workflow looks like this:

  1. Create an Out For Repair (OFR) partition with numgpus=0.

  2. When a faulty compute try needs to be isolated, remove the tray from the running partition.

  3. Add the tray to the OFR partition.

  4. (Optional) Power off the compute tray and remove it from the slot.

  5. (Optional) Insert the tray back into the same slot and power on the tray.

  6. Once maintenance is complete, proceed with the following tasks:
    • Power off the compute tray in the OFR partition

    • Remove the OFR partition.

    • Add tray back to the running partition.

This is a high-level overview. Please consult the Partition Guides available here for detailed information.

Workload Management and Job Scheduling Considerations#

When adapting NVLink partitions it is important to consider how this impacts the workload management systems and job scheduling. In this document we will cover some basics.

  • How it specifically impacts workload management systems and job scheduling will depend upon the specific environment and multi-tenancy stance.

  • NMC supports two workload management systems and will therefore focus on those - Slurm and Run:ai.

Please consult your NVIDIA contact for more information.

Background#

When providing NVIDIA Multi-Node NVLink capable systems (i.e. NVL72 GB200/GB300 rack-scale systems) as shared resources to users, through Slurm and Run:ai, it is vital to understand the NVLink Domains and Partitions.

  • The goal is to make the workload management system aware of the NVLink Partitions so jobs can be scheduled onto nodes within the cluster, across racks, appropriately.

  • The main point is scheduling jobs within Multi-Node NVLink capable blocks (i.e. NVLink Partitions). This is where the highest performance is gauranteed.

  • These optimized blocks of nodes could be the full rack, if the default NVLink partition is used, or certain nodes in a rack confined to their own NVLink partition.

This NVLink topology-aware scheduling should be mostly transparent to the users.

  • What we discuss here is mostly for administrators of the NVL72 system.

  • The specifics of how to do this for your environment will depend upon your NVLink Partition configuration and your Slurm/Run:ai configuration.

Slurm#

Within Slurm, the topology/block plugin was developed in order to help job scheduling on NVIDIA’s Multi-Node NVLink capable rack-scale systems. With this plugin we are able to build a topology out of blocks of nodes.

  • A block is a logical grouping of nodes that reside within the same NVLink Partition.

  • Blocks are non-overlapping: a node can only belong to one block.

  • The block size is the unit you define, which should align with NVLink partitons. When using the default partition, or the full rack, the block size is 18 (all nodes in a rack).

The path to configuring these blocks within NMC is through Base Command Manager here:

[a03-u02-bcm-01->wlm[slurm]->topologysettings]% show
Parameter                        Value
-------------------------------- ------------------------------------------------
Revision
Topology plugin                  Block
Topology source                  Internal
Parameters                       <submode>
Tree settings                    <submode>
Block settings                   <submode>
Topograph settings               <submode>
[a03-u02-bcm-01->wlm[slurm]->topologysettings->blocksettings]% show
Parameter                        Value
-------------------------------- ------------------------------------------------
Block sizes
Revision
Block entity                     Rack
Allowed racks
Allowed nodegroups

Here is an example of the topology.conf file when the Block entity is “Rack”:

root@a03-u02-bcm-01:~# cat /cm/shared/apps/slurm/etc/slurm/topology.conf
# This section of this file was automatically generated by cmd. Do not edit manually!
# BEGIN AUTOGENERATED SECTION -- DO NOT REMOVE
BlockName=A02 Nodes=a02-p01-dgx-01-c[01-18]

In the above, we can see that we have a “BlockName” that is associated with a rack (i.e. “A02”).

  • The above configuration assumes that each rack is using the default partition, which is a full rack. It does not account for partitions within a rack. This is the default behavior.

In order to account for partitions within a rack, in the case that you delete the default partition and create new ones, you want to switch the “Block entity” value to “NodeGroup”.

[a03-u02-bcm-01->wlm[slurm]->topologysettings->blocksettings]% show
Parameter                        Value
-------------------------------- ------------------------------------------------
Block sizes
Revision
Block entity                     NodeGroup
Allowed racks
Allowed nodegroups

We can then configure a new NodeGroup and specific nodes for that NodeGroup:

[a03-u02-bcm-01->nodegroup[ng1]]% show
Parameter                        Value
-------------------------------- ------------------------------------------------
Name                             ng1
Revision
Nodes                            a02-p01-dgx-01-c01..a02-p01-dgx-01-c09

And our topology.conf file will reflect those changes:

root@a03-u02-bcm-01:~# cat /cm/shared/apps/slurm/etc/slurm/topology.conf
# This section of this file was automatically generated by cmd. Do not edit manually!
# BEGIN AUTOGENERATED SECTION -- DO NOT REMOVE
BlockName=ng1 Nodes=a02-p01-dgx-01-c[01-09]

Note

  • Designing proper topology awareness can require expert knowledge about the system. Additionally, it requires clear understanding of the multi-tenancy requirements for the target environment.

  • We did not discuss Slurm Partitions in this context. That is another layer of configuration complexity that should be considered when designing around NVLink Partitions.

There is not one size that fits all. Please consult your NVIDIA contact for more information and guidance.

Run:ai#

  • Run:ai automatically detects NVLink Partitions (or Multi-Node NVLink Domains) for us. Some details are below for those interested.

  • Run:ai taps into node labels provided by the NVIDIA GPU Operator and these labels are identified as “nvidia.com/gpu.clique”.

Recall, that we can see the clique in BCM as well:

[a03-u02-bcm-01->device]% nvdomaininfo -r a02
Node                GPU      Status   Cluster UUID                           Clique ID
------------------- -------- -------- -------------------------------------- ------------
a02-p01-dgx-01-c01  0        Success  a868ae12-3336-47d9-82cc-85febb8ed722   1
02-p01-dgx-01-c01  1        Success  a868ae12-3336-47d9-82cc-85febb8ed722    1
a02-p01-dgx-01-c01  2        Success  a868ae12-3336-47d9-82cc-85febb8ed722   1
02-p01-dgx-01-c01  3        Success  a868ae12-3336-47d9-82cc-85febb8ed722    1
a02-p01-dgx-01-c02  0        Success  a868ae12-3336-47d9-82cc-85febb8ed722   32766
a02-p01-dgx-01-c02  1        Success  a868ae12-3336-47d9-82cc-85febb8ed722   32766
a02-p01-dgx-01-c02  2        Success  a868ae12-3336-47d9-82cc-85febb8ed722   32766
a02-p01-dgx-01-c02  3        Success  a868ae12-3336-47d9-82cc-85febb8ed722   32766
  • The above shows that nodes a02-p01-dgx-01-c01 and a02-p01-dgx-01-c02 are in different NVLink Partitions.

  • Run:ai discovers this via the “nvidia.com/gpu.clique” label that is applied to the GB200/GB300 compute trays that are allocated as Run:ai GPU worker nodes.

Another way to see the node label is through the “kubectl” command:

root@a03-u02-bcm-01:~# kubectl describe nodes | grep clique
nvidia.com/gpu.clique=868ae12-3336-47d9-82cc-85febb8ed722.32766
nvidia.com/gpu.clique=868ae12-3336-47d9-82cc-85febb8ed722.1

Note

What is described above for Run:ai’s ability to autodetect Multi-Node NVlink Domains is very basic and showcases the default behavior.

  • Typically the default behavior is sufficient and customizations are not needed, but please consult with your NVIDIA contacts.

previous

Out-of-Band Management

next

NVLink Management Software (NMX + NetQ)

On this page
  • Adapting NVLink Partitions
    • Multi-Cast Support
    • NVLink Partition Creation Example
    • Batch GPU Operations via gRPC
  • Maintenance Flows for Compute Trays
  • Workload Management and Job Scheduling Considerations
    • Background
    • Slurm
    • Run:ai
NVIDIA NVIDIA

Copyright © 2024-2026, NVIDIA Corporation.

Last updated on Mar 04, 2026.