Slurm Workload Management#
NVIDIA Mission Control utilizes Slurm, an open-source workload manager developed by SchedMD, to orchestrate and schedule jobs across the DGX SuperPOD cluster. This guide assumes that administrators are already familiar with basic Slurm operations and configuration.
Detailed instructions on managing Slurm queues, submitting and monitoring jobs, managing node states (drain, resume), and configuring job prolog/epilog scripts can be found in the following resources:
Base Command Manager (BCM) Administrator Guide:
Managing Slurm and Workload Integration (Section 7)
Slurm Prolog and Epilog Scripts (Section 7.3.4)
SchedMD Slurm Documentation:
Slurm and IMEX#
New to NVIDIA GB200 system, NVIDIA Internode Memory Exchange Service (IMEX) is a secure service that facilitates the mapping of GPU memory over NVLink between the GPUs in an NVLIink domain. BCM can enable the IMEX daemon either globally, or per job for Slurm.
It is recommended to implement IMEX daemon per-job for Slurm. Running the IMEX daemon per job has the advantage that one user running a job cannot read the memory of a job run by another user on another node.
Configuring per job means that IMEX runs just before the job starts, and that the service runs on only the nodes that need it.
To run IMEX per job, any global IMEX setting must first be cleared away. This can be done by first:
Removing the nvidia-imex service entirely from CMDaemon.
Stopping the service on the compute nodes, for example with pdssh or pdexec, or simply carrying out a reboot The value of imex in the slurmclient role can then be set:
[root@basecm11 ~]# cmsh
[basecm11]% configurationoverlay roles slurm-client-gpu
[basecm11->configurationoverlay[slurm-client-gpu]->roles]% use slurmclient
[basecm11->configurationoverlay[slurm-client-gpu]->roles[slurmclient]]% set imex yes
[basecm11->configurationoverlay*[slurm-client-gpu*]->roles*[slurmclient*]]% commit
[basecm11->configurationoverlay[slurm-client-gpu]->roles[slurmclient]]%
Committing the above IMEX setting configures Slurm prolog and epilog scripts in the backend as follows:
The prolog script configures the IMEX daemon with the nodes allocated to the GPU job and starts the IMEX daemon on the node
The epilog cleans up the configuration and stops the IMEX daemon on the node.
Slurm Topology Block Support#
Slurm can be configured to support topology-aware resource allocation to optimize job performance. More information about topologies in Slurm is available at the Slurm Topology Guide.
Previous version(s) of BCM only supported the topology/tree option for Slurm configuration. BCM 11 introduces support for topology/block. This is required for BCM to enable NVLINK-aware scheduling of Slurm jobs for GB200 systems.
The topology.conf file describes the cluster’s network topology for optimized job resource allocation. BCM 11 is able to generate different types of topology.conf to customize the Slurm behavior of the topology plugins. More information about topology.conf is available at the Slurm Topology Configuration documentation.
The location of topology.conf in BCM head-node(?) is: /cm/shared/apps/slurm/etc/<CLUSTER_NAME>/topology.conf
.
To support topology/block, CMD introduces new entities to customize Slurm in cmsh and Base View: SlurmTreeTopologySettings and SlurmBlockTopologySettings become both valid objects for the parent SlurmTopologySettings field.
The new commands to configure the SlurmBlockTopologySettings are available in cmsh and Base View in the wlm mode/section:
[basecm11->wlm[slurm]]% topologysettings
[basecm11->wlm[slurm]->topologysettings]% show
Parameter Value
-------------------------------- ---------------------------------------------
Topology plugin None
Topology source None
Parameters <submode>
Tree settings <submode>
Block settings <submode>
Topograph settings <submode>
From the topologysettings menu a few parameters submodes are available:
Topology plugin: parameter that defines which Slurm topology plugin will be used (tree or block). By default none is set and topology.conf is not generated.
Topology source: parameter that defines where BCM takes information for the topology construction. Values: internal (taken from switch/racks information), topograph: taken from topograph service.
Parameters: submode that allows to tune Slurm parameters related to all topology plugins.
Tree settings: submode that allows configure settings of tree topology.
Block settings: submode that allows configure settings of block topology.
If block is chosen as the topology plugin, additional options are available for the administrator (e.g. block size, type of racks, etc) in block settings submode. The administrator can manually define Slurm blocks leveraging information on the cluster layout, such as full GB200 racks (1x72 configuration), half GB200 racks (2x36 configuration), or generic nodegroups.
Slurm Partitioning with NVLink#
A Slurm partition is a distinct job queue that groups compute nodes together and enables the ability to set specific resource constraints and limits for those nodes.
The following shows examples of different partitions (or job queues):
Slurm partition configuration - Basic properties
Partition |
Description |
Priority |
Default Runtime |
Max Runtime |
Preemptable |
---|---|---|---|---|---|
batch (default) |
Default partition, suitable for most jobs |
Medium |
30 minutes |
4 hours |
No |
batch_short |
Higher priority with 2-hour limit for quick jobs |
High |
30 minutes |
2 hours |
No |
batch_long |
Lower priority for longer-running jobs |
Low |
30 minutes |
8 hours |
No |
batch_large |
For very large jobs (>50% of cluster) |
High+++ |
30 minutes |
4 hours |
No |
interactive |
High priority for development and testing |
High+ |
30 minutes |
4 hours |
No |
backfill |
Low priority, preemptable, minimal restrictions |
Low |
30 minutes |
7 days |
Yes |
admin |
Administrative debugging partition |
High++ |
30 minutes |
8 hours |
No |
Slurm partition configuration - Resource limits and access
Partition |
Node Types / Count |
Resource Limits / QoS |
Allowed Accounts |
---|---|---|---|
batch (default) |
All nodes |
Set max nodes per job |
All |
batch_short |
All nodes |
Max 4 nodes per user, max 20 nodes total |
All |
batch_long |
All nodes |
Max 10% of partition nodes per user |
Approval required |
batch_large |
All nodes |
Prioritizes large jobs on the system |
Approval required |
interactive |
All nodes + 6 dedicated |
Max 2 nodes per job, max 2 nodes per user |
All |
backfill |
All nodes |
Preempted jobs requeue automatically |
All |
admin |
All nodes + dedicated interactive |
Admin only access |
admin |
Within BCM, you can create and manage Slurm partitions through cmsh:
[root@basecm11 ~]# cmsh
[basecm11]% wlm
[basecm11->wlm[slurm]]% jobqueue
[basecm11->wlm[slurm]->jobqueue]% help
...
=============================== jobqueue ===============================
add ........................... Create and use a jobqueue
append ........................ Append value(s) to jobqueue property
clear ......................... Clear specific jobqueue property
clone ......................... Clone and use a jobqueue
commit ........................ Commit local changes
foreach ....................... Execute a set of commands on several jobqueues
format ........................ Modify or view current list format
get ........................... Get specific jobqueue property
jobqueue ...................... Enter job queues mode
list .......................... List overview
range ......................... Set a range of several jobqueues to execute future commands on
refresh ....................... Revert local changes
remove ........................ Remove a jobqueue
removefrom .................... Remove value(s) from jobqueue property
set ........................... Set jobqueue properties
show .......................... Show jobqueue properties
sort .......................... Modify or view current list sort order
statistics .................... Get job queue statistics.
swap .......................... Swap uuid names of two jobqueue
undefine ...................... Undefine specific jobqueue property
use ........................... Use the specified jobqueue
usedby ........................ List all entities which depend on this jobqueue
validate ...................... Remote validate a jobqueue
[basecm11->wlm[slurm]->jobqueue]% add batch_long
[basecm11->wlm[slurm]->jobqueue[batch_long]% show
Parameter Value
-------------------------------- ---------------------------------------------
Name batch_long
Revision
Type Slurm
WlmCluster slurm
Ordering 0
Default no
Hidden no
Min nodes 1
Max nodes
Default time UNLIMITED
Max time 12:00:00
Priority Job Factor 1
Priority Tier 1
OverSubscribe exclusive
Alternate
Grace time 0
Preemption mode OFF
Require reservation NO
Select Type Parameters
LLN no
TRES Billing Weights
Alloc nodes
CPU bindings None
QOS
Default memory per CPU UNLIMITED
Max memory per CPU UNLIMITED
Default memory per Node UNLIMITED
Max memory per Node UNLIMITED
Max CPUs per node UNLIMITED
Default memory per GPU UNLIMITED
Default CPU per GPU UNLIMITED
Disable root no
Root only no
Allow groups
Allow accounts
Allow QOS ALL
Deny Accounts
Deny QOS
ExclusiveUser no
Queue State None
Nodesets
Overlays
Categories
Nodegroups
Compute nodes
All nodes
Options
At this point, you can configure the batch_long partition with various resource constraints and limits to fit the use case need.
Example that sets compute nodes for batch_long:
[basecm11->wlm[slurm]->jobqueue[batch_long]% set computenodes <computenode names>
Example that adds a user to a jobqueue (you can also append users/groups or add a list separated by comma):
[basecm11->wlm[slurm]->jobqueue[batch_long]% set allowaccounts <user_id>
[basecm11->wlm[slurm]->jobqueue[batch_long]% set allowgroups <group_id>
When designing partitions, it is important to consider the underlying NVLink network. For instance, each NVL72 rack is by default an NVLink partition that includes 72 GPUs connected through NVLink. Ideally, compute nodes in a Slurm partition also share the same NVLink partition.
The exception is when you need to combine two NVL72 racks (or NVLink Domains) into a single Slurm partition. This is necessary for large training jobs that require more resources than a single NVL72 rack provides.
In this example scenario, where two NVL72 racks exists, a combination of Slurm partitions across both racks could be configured with this logic:
Flex-R1: flexible partition, for small scale testing, that allows max of two nodes for max of two hours on Rack 1.
Flex-R2: flexible partition, for small scale testing, that allows max of two nodes for max of two hours on Rack 2.
Benchmark-R1: Allows full rack, 18 node, jobs with a max time of twelve hours on Rack 1.
Benchmark-R2: Allows full rack, 18 node, jobs with a max time of twelve hours on Rack 2.
Benchmark-Combined: Allows 36 node jobs across both racks with a max time of twelve hours
Common Slurm Management Tasks#
Creating a Slurm partition#
Within BCM, you can create and manage slurm partitions through cmsh:
[root@basecm11 ~]# cmsh
[basecm11]% wlm
[basecm11->wlm[slurm]]% jobqueue
[basecm11->wlm[slurm]->jobqueue]% help
...
=============================== jobqueue ===============================
add ........................... Create and use a jobqueue
append ........................ Append value(s) to jobqueue property
clear ......................... Clear specific jobqueue property
clone ......................... Clone and use a jobqueue
commit ........................ Commit local changes
foreach ....................... Execute a set of commands on several jobqueues
format ........................ Modify or view current list format
get ........................... Get specific jobqueue property
jobqueue ...................... Enter job queues mode
list .......................... List overview
range ......................... Set a range of several jobqueues to execute future commands on
refresh ....................... Revert local changes
remove ........................ Remove a jobqueue
removefrom .................... Remove value(s) from jobqueue property
set ........................... Set jobqueue properties
show .......................... Show jobqueue properties
sort .......................... Modify or view current list sort order
statistics .................... Get job queue statistics.
swap .......................... Swap uuid names of two jobqueue
undefine ...................... Undefine specific jobqueue property
use ........................... Use the specified jobqueue
usedby ........................ List all entities which depend on this jobqueue
validate ...................... Remote validate a jobqueue
[basecm11->wlm[slurm]->jobqueue]% add batch_long
[basecm11->wlm[slurm]->jobqueue[batch_long]% show
Parameter Value
-------------------------------- ---------------------------------------------
Name batch_long
Revision
Type Slurm
WlmCluster slurm
Ordering 0
Default no
Hidden no
Min nodes 1
Max nodes
Default time UNLIMITED
Max time 12:00:00
Priority Job Factor 1
Priority Tier 1
OverSubscribe exclusive
Alternate
Grace time 0
Preemption mode OFF
Require reservation NO
Select Type Parameters
LLN no
TRES Billing Weights
Alloc nodes
CPU bindings None
QOS
Default memory per CPU UNLIMITED
Max memory per CPU UNLIMITED
Default memory per Node UNLIMITED
Max memory per Node UNLIMITED
Max CPUs per node UNLIMITED
Default memory per GPU UNLIMITED
Default CPU per GPU UNLIMITED
Disable root no
Root only no
Allow groups
Allow accounts
Allow QOS ALL
Deny Accounts
Deny QOS
ExclusiveUser no
Queue State None
Nodesets
Overlays
Categories
Nodegroups
Compute nodes
All nodes
Options
Any of the properties above can be set through cmsh -> wlm. The below sets the compute nodes b08-p1-dgx-08-c[01-18] to be allocated from the batch_long partition.
[basecm11->wlm[slurm]->jobqueue[batch_long]% set computenodes b08-p1-dgx-08-c[01-18]
Draining nodes can be done through the common Slurm CLI tool scontrol.
scontrol update nodename=a06-p1-dgx-02-c[01-04,06-11,13-14,16-18] state=drain reason="maintenance"
Resuming nodes can be done similarly:
scontrol update nodename=a06-p1-dgx-02-c[01-04,06-11,13-14,16-18] state=resume
Adding or Appending to a jobqueue#
Adding a user or group to a jobqueue (partition):
[basecm11->wlm[slurm]->jobqueue[batch_long]% set allowaccounts <user_id>
[basecm11->wlm[slurm]->jobqueue[batch_long]% set allowgroups <group_id>
Append a user or group to a jobqueue (partition):
[basecm11->wlm[slurm]->jobqueue[batch_long]% append allowaccounts <user_id>
[basecm11->wlm[slurm]->jobqueue[batch_long]% append allowgroups <group_id>
Example: Running Interactive Slurm Job to confirm IMEX setup#
NOTE: This example assumes IMEX is configured to run per-job and that you have access to the slurm-login node.
Check Slurm partitions and node availability with sinfo.
root@a03-p1-aps-arm-01:~# sinfo
PARTITION AVAIL TIMELIMIT NODES STATE NODELIST
bcm_qa_test up infinite 18 alloc b08-p1-dgx-08-c[01-18]
benchmark-a07 up 12:00:00 2 drain a07-p1-dgx-03-c[06,12]
benchmark-a07 up 12:00:00 15 idle a07-p1-dgx-03-c[01-05,07,09-11,13-18]
benchmark-b05 up 12:00:00 18 idle b05-p1-dgx-05-c[01-18]
benchmark-combined up 12:00:00 2 drain a07-p1-dgx-03-c[06,12]
benchmark-combined up 12:00:00 33 idle a07-p1-dgx-03-c[01-05,07,09-11,13-18],b05-p1-dgx-05-c[01-18]
computex_demo up infinite 1 down* b06-p1-dgx-06-c01
computex_demo up infinite 16 alloc b06-p1-dgx-06-c[10-17],b07-p1-dgx-07-c[01-08]
computex_demo up infinite 13 idle b06-p1-dgx-06-c[02-03,18],b07-p1-dgx-07-c[09-18]
defq* up infinite 4 drain* a05-p1-dgx-01-c[01,07,16],b06-p1-dgx-06-c04
defq* up infinite 4 down* a05-p1-dgx-01-c10,b06-p1-dgx-06-c[01,06-07]
defq* up infinite 3 drain a07-p1-dgx-03-c[06,12],b06-p1-dgx-06-c05
defq* up infinite 48 alloc a05-p1-dgx-01-c[02-06,08-09,11-15,17-18],b06-p1-dgx-06-c[10-17],b07-p1-dgx-07-c[01-08],b08-p1-dgx-08-c[01-18]
defq* up infinite 48 idle a07-p1-dgx-03-c[01-05,07,09-11,13-18],b05-p1-dgx-05-c[01-18],b06-p1-dgx-06-c[02-03,08-09,18],b07-p1-dgx-07-c[09-18]
flex-a07 up 2:00:00 2 drain a07-p1-dgx-03-c[06,12]
flex-a07 up 2:00:00 15 idle a07-p1-dgx-03-c[01-05,07,09-11,13-18]
flex-b05 up 2:00:00 18 idle b05-p1-dgx-05-c[01-18]
Select idle nodes to test and create slurm allocation on (in this example a full NVL72 rack is selected).
root@a03-p1-aps-arm-01:~# salloc -N 18 -w b05-p1-dgx-05-c[01-18]
salloc: Granted job allocation 3912
salloc: Waiting for resource configuration
salloc: Nodes b05-p1-dgx-05-c[01-18] are ready for job
SSH to node and run nvidia-imex-ctl -N.
root@b05-p1-dgx-05-c01:~# nvidia-imex-ctl -N
Connectivity Table Legend:
I - Invalid - Node wasn't reachable, no connection status available
N - Never Connected
R - Recovering - Connection was lost, but clean up has not yet been triggered.
D - Disconnected - Connection was lost, and clean up has been triggreed.
A - Authenticating - If GSSAPI enabled, client has initiated mutual authentication.
!V! - Version mismatch, communication disabled.
!M! - Node map mismatch, communication disabled.
C - Connected - Ready for operation
5/13/2025 14:46:11.588
Nodes:
Node #0 - 7.241.18.139 - READY - Version: 570.124.06
Node #1 - 7.241.18.140 - READY - Version: 570.124.06
Node #2 - 7.241.18.141 - READY - Version: 570.124.06
Node #3 - 7.241.18.142 - READY - Version: 570.124.06
Node #4 - 7.241.18.143 - READY - Version: 570.124.06
Node #5 - 7.241.18.144 - READY - Version: 570.124.06
Node #6 - 7.241.18.145 - READY - Version: 570.124.06
Node #7 - 7.241.18.146 - READY - Version: 570.124.06
Node #8 - 7.241.18.147 - READY - Version: 570.124.06
Node #9 - 7.241.18.148 - READY - Version: 570.124.06
Node #10 - 7.241.18.149 - READY - Version: 570.124.06
Node #11 - 7.241.18.150 - READY - Version: 570.124.06
Node #12 - 7.241.18.151 - READY - Version: 570.124.06
Node #13 - 7.241.18.152 - READY - Version: 570.124.06
Node #14 - 7.241.18.153 - READY - Version: 570.124.06
Node #15 - 7.241.18.154 - READY - Version: 570.124.06
Node #16 - 7.241.18.155 - READY - Version: 570.124.06
Node #17 - 7.241.18.156 - READY - Version: 570.124.06
Nodes From\To 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17
0 C C C C C C C C C C C C C C C C C C
1 C C C C C C C C C C C C C C C C C C
2 C C C C C C C C C C C C C C C C C C
3 C C C C C C C C C C C C C C C C C C
4 C C C C C C C C C C C C C C C C C C
5 C C C C C C C C C C C C C C C C C C
6 C C C C C C C C C C C C C C C C C C
7 C C C C C C C C C C C C C C C C C C
8 C C C C C C C C C C C C C C C C C C
9 C C C C C C C C C C C C C C C C C C
10 C C C C C C C C C C C C C C C C C C
11 C C C C C C C C C C C C C C C C C C
12 C C C C C C C C C C C C C C C C C C
13 C C C C C C C C C C C C C C C C C C
14 C C C C C C C C C C C C C C C C C C
15 C C C C C C C C C C C C C C C C C C
16 C C C C C C C C C C C C C C C C C C
17 C C C C C C C C C C C C C C C C C C
Domain State: UP
The nvidia-imex-ctl -N command prints the full status of the current IMEX Domain. It is a useful way to confirm the health of an NVL72 system.
As stated, BCM provides a way to create IMEX Domains per-job in Slurm. The IMEX service and configuration file can be observed too.
Checking nvidia-imex.service on a DGX node.
root@b05-p1-dgx-05-c01:~# systemctl status nvidia-imex.service
● nvidia-imex.service - NVIDIA IMEX service
Loaded: loaded (/usr/lib/systemd/system/nvidia-imex.service; enabled; preset: enabled)
Active: active (running) since Tue 2025-05-13 14:44:29 PDT; 5min ago
Process: 99921 ExecStart=/usr/bin/nvidia-imex -c /etc/nvidia-imex/config.cfg (code=exited, status=0/SUCCESS)
Main PID: 99981 (nvidia-imex)
Tasks: 32 (limit: 293736)
Memory: 10.3M ()
CGroup: /system.slice/nvidia-imex.service
└─99981 /usr/bin/nvidia-imex -c /etc/nvidia-imex/config.cfg
May 13 14:44:29 b05-p1-dgx-05-c01 systemd[1]: Starting nvidia-imex.service - NVIDIA IMEX service...
May 13 14:44:29 b05-p1-dgx-05-c01 systemd[1]: Started nvidia-imex.service - NVIDIA IMEX service.
Check /etc/nvidia-imex/nodes_config.cfg (this list is populated with the same nodes that are in $SLURM_JOB_NODELIST).
root@b05-p1-dgx-05-c01:~# cat /etc/nvidia-imex/nodes_config.cfg
# imex nodes_config for job 3912
# b05-p1-dgx-05-c01
7.241.18.139
# b05-p1-dgx-05-c02
7.241.18.140
# b05-p1-dgx-05-c03
7.241.18.141
# b05-p1-dgx-05-c04
7.241.18.142
# b05-p1-dgx-05-c05
7.241.18.143
# b05-p1-dgx-05-c06
7.241.18.144
# b05-p1-dgx-05-c07
7.241.18.145
# b05-p1-dgx-05-c08
7.241.18.146
# b05-p1-dgx-05-c09
7.241.18.147
# b05-p1-dgx-05-c10
7.241.18.148
# b05-p1-dgx-05-c11
7.241.18.149
# b05-p1-dgx-05-c12
7.241.18.150
# b05-p1-dgx-05-c13
7.241.18.151
# b05-p1-dgx-05-c14
7.241.18.152
# b05-p1-dgx-05-c15
7.241.18.153
# b05-p1-dgx-05-c16
7.241.18.154
# b05-p1-dgx-05-c17
7.241.18.155
# b05-p1-dgx-05-c18
7.241.18.156
Checking /var/log/nvidia-imex.log on DGX node.
IMEX Log initializing at: 5/13/2025 14:44:29.455
[May 13 2025 14:44:29] [INFO] [tid 99981] IMEX version 570.124.06 is running with the following configuration options
[May 13 2025 14:44:29] [INFO] [tid 99981] Logging level = 4
[May 13 2025 14:44:29] [INFO] [tid 99981] Logging file name/path = /var/log/nvidia-imex.log
[May 13 2025 14:44:29] [INFO] [tid 99981] Append to log file = 1
[May 13 2025 14:44:29] [INFO] [tid 99981] Max Log file size = 1024 (MBs)
[May 13 2025 14:44:29] [INFO] [tid 99981] Use Syslog file = 0
[May 13 2025 14:44:29] [INFO] [tid 99981] IMEX Library communication bind interface =
[May 13 2025 14:44:29] [INFO] [tid 99981] IMEX library communication bind port = 29262
[May 13 2025 14:44:29] [INFO] [tid 99981] Identified this node as ID 0, using bind IP of '7.241.18.139', and network interface of bond0
[May 13 2025 14:44:29] [INFO] [tid 99981] NvGpu Library version matched with GPU Driver version
[May 13 2025 14:44:29] [INFO] [tid 100014] Started processing of incoming messages.
[May 13 2025 14:44:29] [INFO] [tid 100015] Started processing of incoming messages.
[May 13 2025 14:44:29] [INFO] [tid 100016] Started processing of incoming messages.
[May 13 2025 14:44:29] [INFO] [tid 99981] Creating gRPC channels to all peers (nPeers = 18).
[May 13 2025 14:44:29] [INFO] [tid 100017] Started processing of incoming messages.
[May 13 2025 14:44:29] [INFO] [tid 100018] Connection established to node 0 with ip address 7.241.18.139. Number of times connected: 1
[May 13 2025 14:44:29] [INFO] [tid 100018] Connection established to node 3 with ip address 7.241.18.142. Number of times connected: 1
[May 13 2025 14:44:29] [INFO] [tid 100018] Connection established to node 2 with ip address 7.241.18.141. Number of times connected: 1
[May 13 2025 14:44:29] [INFO] [tid 100018] Connection established to node 4 with ip address 7.241.18.143. Number of times connected: 1
[May 13 2025 14:44:29] [INFO] [tid 100018] Connection established to node 1 with ip address 7.241.18.140. Number of times connected: 1
[May 13 2025 14:44:29] [INFO] [tid 100018] Connection established to node 5 with ip address 7.241.18.144. Number of times connected: 1
[May 13 2025 14:44:29] [INFO] [tid 100018] Connection established to node 7 with ip address 7.241.18.146. Number of times connected: 1
[May 13 2025 14:44:29] [INFO] [tid 100018] Connection established to node 6 with ip address 7.241.18.145. Number of times connected: 1
[May 13 2025 14:44:29] [INFO] [tid 100018] Connection established to node 8 with ip address 7.241.18.147. Number of times connected: 1
[May 13 2025 14:44:29] [INFO] [tid 99981] IMEX_WAIT_FOR_QUORUM != FULL, continuing initialization without waiting for connections to all nodes.
[May 13 2025 14:44:29] [INFO] [tid 100018] Connection established to node 9 with ip address 7.241.18.148. Number of times connected: 1
[May 13 2025 14:44:29] [INFO] [tid 100018] Connection established to node 10 with ip address 7.241.18.149. Number of times connected: 1
[May 13 2025 14:44:29] [INFO] [tid 100018] Connection established to node 11 with ip address 7.241.18.150. Number of times connected: 1
[May 13 2025 14:44:29] [INFO] [tid 100018] Connection established to node 12 with ip address 7.241.18.151. Number of times connected: 1
[May 13 2025 14:44:29] [INFO] [tid 100018] Connection established to node 13 with ip address 7.241.18.152. Number of times connected: 1
[May 13 2025 14:44:29] [INFO] [tid 100018] Connection established to node 14 with ip address 7.241.18.153. Number of times connected: 1
[May 13 2025 14:44:29] [INFO] [tid 100018] Connection established to node 15 with ip address 7.241.18.154. Number of times connected: 1
[May 13 2025 14:44:29] [INFO] [tid 100018] Connection established to node 16 with ip address 7.241.18.155. Number of times connected: 1
[May 13 2025 14:44:29] [INFO] [tid 99981] GPU event successfully subscribed
[May 13 2025 14:44:29] [INFO] [tid 100018] Connection established to node 17 with ip address 7.241.18.156. Number of times connected: 1