Deployment Models#

In an NVLink multi-node cluster, the IMEX service must be initiated and initialized before launching CUDA applications. Before starting the service, an administrator or job launcher must populate the IP address information of all involved nodes in nodes_config.cfg file, which is specified by the IMEX_NODE_CONFIG_FILE option. This chapter provides information about managing IP addresses, starting and stopping the service, and other related tasks that might vary depending on the deployment models.

Constraints#

The IMEX Service transitions to the ready-for-processing state after achieving the configured level of quorum (refer to “Quorum” on page 27) with the IMEX instance IP addresses configured in the IMEX_NODES_CONFIG_FILE (nodes_config.cfg). After establishing gRPC connections with the other IMEX instances that are running on the specified nodes, the service verifies that its IP address map aligns with the configuration of every other instance’s IP address map. When initializing the IMEX service, IP address mapping disrepencies will lead to failures.

Deployment Models#

This section provides information about deployment models.

NVLink Domain Wide#

The simplest form of deployment involves starting IMEX instances across the NVLink domain. In this option, each node’s IMEX service is configured with the list of IP addresses of all compute nodes in the domain. For instance, on an NVIDIA GH200 NVL32 with 16 compute node instances, the IMEX nodes_config.cfg file contains 16 IP addresses.

NVLink Partition Wide#

Another option is to initiate IMEX instances at the NVLink partition boundaries. Here, the IMEX nodes_config.cfg file will include the IP addresses of the compute nodes in each NVLink partition. In this option, the IMEX domain matches the partition, so there is less chance of confusing errors in CUDA. However, this issue requires the IMEX service to be reconfigured and restarted when the fabric partition configurations change.

Per-Job Wide#

You can also deploy IMEX instances on a per-job basis, where the IMEX_nodes_config.cfg file includes the IP addresses of the compute nodes that are part of the job. This option ensures that failures are isolated to the specific job, which minimizes the impact on other jobs or the cluster. However, this option requires managing the lifecycle of IMEX instances for each job, which includes preparing the IMEX_nodes_config.cfg file, starting IMEX services before the job begins, and tearing down IMEX services after the job completes.

Job Scheduler Integration#

This section provides information about integrating the job scheduler.

SLURM Scheduler Integration#

To integrate with the SLURM job scheduler, here are some examples of prolog and epilog scripts that enable and clean up IMEX as to bring up jobs and shut down.

The prolog script will attempt to kill an existing IMEX service before configuring a new instance that will be specific to the new, submitted job.

Here are the elements that are configured as part of the job:

nodes_config.cfg: Populated with the IP addresses of the compute nodes participating in the job.
SERVER_PORT: To reduce the chance of cross-communication of jobs, generates a unique port for all members of the job.
IMEX_CONN_WAIT_TIMEOUT: Sets a timeout value so the prolog can terminate if an error in the cluster prevents IMEX from fully coming up.
IMEX_CMD_PORT: Sets the command port service to allow the nvidia-imex-ctl application to connect and query the IMEX service.
IMEX_CMD_ENABLED: Enables the command service.

Here is the prolog script:

#!/usr/bin/env bash
if ! systemctl list-units --full --all | grep -Fq "nvidia-imex.service"; then
  exit 0
fi
{
  set -ex
  # Clean the config file in case the service gets started by accident
  > /etc/nvidia-imex/nodes_config.cfg

  NVIDIA_IMEX_START_TIMEOUT=60
  IMEX_CONN_WAIT_TIMEOUT=70
  NVIDIA_IMEX_STOP_TIMEOUT=15

  # clean up prev connection
  set +e
  timeout $NVIDIA_IMEX_STOP_TIMEOUT systemctl stop nvidia-imex
  pkill -9 nvidia-imex
  set -e

  # update peer list
  scontrol -a show node "${SLURM_NODELIST}" -o | sed 's/^.* NodeAddr=\([^ ]*\).*/\1/' > /etc/nvidia-imex/nodes_config.cfg

  # rotate server port to prevent race condition
  NEW_SERVER_PORT=$((${SLURM_JOB_ID} % 16384 + 33792))
  sed -i "s/SERVER_PORT.*/SERVER_PORT=${NEW_SERVER_PORT}/" /etc/nvidia-imex/config.cfg

  # enable imex-ctl on all nodes so you can query imex status with: nvidia-imex-ctl -a -q
  sed -i "s/IMEX_CMD_PORT.*/IMEX_CMD_PORT=50005/" /etc/nvidia-imex/config.cfg
  sed -i "s/IMEX_CMD_ENABLED.*/IMEX_CMD_ENABLED=1/" /etc/nvidia-imex/config.cfg

  # set timeouts for start
  sed -i "s/IMEX_CONN_WAIT_TIMEOUT.*/IMEX_CONN_WAIT_TIMEOUT=${IMEX_CONN_WAIT_TIMEOUT}/" /etc/nvidia-imex/config.cfg
  timeout $NVIDIA_IMEX_START_TIMEOUT systemctl start nvidia-imex
} > "/var/log/slurm/imex_prolog_${SLURM_JOB_ID}.log" 2>&1

Note

Using rotating ports can sometimes result in port conflicts, which prevents nvidia-imex from starting. If you have any issues, change the prolog to use static ports. You can also use ports in the privileged or registered ranges.

The epilog script terminates the IMEX service.

#!/usr/bin/env bash set -ex if ! systemctl list-units –full –all | grep -Fq “nvidia-imex.service”; then exit 0 fi

Clean the config file in case the service gets started by accident#

/etc/nvidia-imex/nodes_config.cfg

NVIDIA_IMEX_STOP_TIMEOUT=15

clean up connection#

set +e timeout $NVIDIA_IMEX_STOP_TIMEOUT systemctl stop nvidia-imex pkill -9 nvidia-imex set -e