For general TensorRT-LLM features and configuration, see the Reference Guide.
Note: The scripts referenced in this example (such as
srun_aggregated.shandsrun_disaggregated.sh) can be found inexamples/basics/multinode/trtllm/.
To run a single Dynamo+TRTLLM Worker that spans multiple nodes (ex: TP16),
the set of nodes need to be launched together in the same MPI world, such as
via mpirun or srun. This is true regardless of whether the worker is
aggregated, prefill-only, or decode-only.
In this document we will demonstrate two examples launching multinode workers
on a slurm cluster with srun:
NOTE: Some of the scripts used in this example like start_frontend_services.sh and
start_trtllm_worker.sh should be translatable to other environments like Kubernetes, or
using mpirun directly, with relative ease.
For simplicity of the example, we will make some assumptions about your slurm cluster:
First, we assume you have access to a slurm cluster with multiple GPU nodes available. For functional testing, most setups should be fine. For performance testing, you should aim to allocate groups of nodes that are performantly inter-connected, such as those in an NVL72 setup.
Second, we assume this slurm cluster has the Pyxis
SPANK plugin setup. In particular, the srun_aggregated.sh script in this
example will use srun arguments like --container-image,
--container-mounts, and --container-env that are added to srun by Pyxis.
If your cluster supports similar container based plugins, you may be able to
modify the script to use that instead.
Third, we assume you have a Dynamo+TRTLLM container image available.
You can use the prebuilt container or build a custom one.
This is the image that can be set to the IMAGE environment variable in later steps.
Fourth, we assume you pre-allocate a group of nodes using salloc. We
will allocate 8 nodes below as a reference command to have enough capacity
to run both examples. If you plan to only run the aggregated example, you
will only need 4 nodes. If you customize the configurations to require a
different number of nodes, you can adjust the number of allocated nodes
accordingly. Pre-allocating nodes is technically not a requirement,
but it makes iterations of testing/experimenting easier.
Make sure to set your PARTITION and ACCOUNT according to your slurm cluster setup:
Lastly, we will assume you are inside an interactive shell on one of your allocated
nodes, which may be the default behavior after executing the salloc command above
depending on the cluster setup. If not, then you should SSH into one of the allocated nodes.
This example aims to automate as much of the environment setup as possible, but all slurm clusters and environments are different, and you may need to dive into the scripts to make modifications based on your specific environment.
Assuming you have already allocated your nodes via salloc, and are
inside an interactive shell on one of the allocated nodes, set the
following environment variables based:
Assuming you have at least 4 nodes allocated following the setup steps above, follow these steps below to launch an aggregated deployment across 4 nodes:
Assuming you have at least 8 nodes allocated (4 for prefill, 4 for decode) following the setup above, follow these steps below to launch a disaggregated deployment across 8 nodes:
Make sure you have a fresh environment and don’t still have the aggregated example above still deployed on the same set of nodes.
To launch multiple replicas of the configured prefill/decode workers, you can set NUM_PREFILL_WORKERS and NUM_DECODE_WORKERS respectively (default: 1).
srun_aggregated.sh launches two srun jobs. The first launches
etcd, NATS, and the OpenAI frontend on the head node only
called “node1” in the example output below. The second launches
a single TP16 Dynamo+TRTLLM worker spread across 4 nodes, each node
using 4 GPUs each.
srun_disaggregated.sh, it follows a very similar flow, but instead launches
three srun jobs instead of two. One for frontend, one for prefill worker,
and one for decode worker.To verify the deployed model is working, send a curl request:
To cleanup background srun processes launched by srun_aggregated.sh or
srun_disaggregated.sh, you can run:
/dev/shm/moe_*. For
now, you must manually clean these up before deploying again on the
same set of nodes.srun
jobs. After cleaning up any leftover shared memory files as described
above, the GPU memory may slowly come back. You can run watch nvidia-smi
to check on this behavior. If you don’t free the GPU memory before the
next deployment, you may get a CUDA OOM error while loading the model.