Is this page helpful?

Multi-Device Inference#

The multi-device inference scales TensorRT inference across multiple GPUs. Use it when models exceed single-GPU memory or when parallel execution reduces latency for memory-bound workloads.

To support these multi-device workflows, TensorRT relies on NVIDIA NCCL. For improved environment compatibility and deployment flexibility, TensorRT features automated NCCL library discovery, checking for libnccl.so.2 before seamlessly falling back to libnccl.so via LD_LIBRARY_PATH.

Multi-device provides two capabilities:

DistCollective – Distributed collective operations (AllReduce, AllGather, Broadcast, Reduce, ReduceScatter, AllToAll, Gather, Scatter) using NVIDIA NCCL. Requires Ampere (SM 80) or later.
Multi-device attention – Attention layers with context parallelism that split the key-value sequence across GPUs. BF16 and FP16 only. Requires Blackwell (SM 100) or later.

For operator-level API details, refer to the DistCollective and Attention sections in the TensorRT Operator documentation.

When a single-device deployment is not an option but you still need fault isolation between tenants on the same physical device, refer to Cross-Context CUDA Error Isolation.

For more information, refer to the sampleDistCollective and attention_mdtrt samples.

Setup#

Configure layers for multi-device after adding them to the network.
- For DistCollective layers, set the number of ranks:
  C++
  collectiveLayer->setNbRanks(numGpus);
  Python
  collective_layer.num_ranks = num_gpus
- For multi-device attention, set the number of ranks:
  C++
  attention->setNbRanks(numGpus);
  Python
  attention.num_ranks = num_gpus
Initialize a NCCL communicator with ncclCommInitRank or ncclCommInitAll and set it on the execution context before inference.
C++
context->setCommunicator(ncclComm);
Python
context.set_communicator(nccl_comm)
Execute inference on all ranks with synchronized enqueueV3 (C++) or execute_async_v3 (Python) calls.

Warning

All participating ranks must call the execution method concurrently because NCCL collective operations block until every rank has participated. If any rank fails to issue its execution call, the other ranks will hang indefinitely.
C++
// Each rank calls enqueueV3 on its own CUDA stream context->enqueueV3(stream);
Python
# Each rank calls execute_async_v3 on its own CUDA stream context.execute_async_v3(stream_ptr)
Each rank must load the same engine, allocate its own input/output buffers, and use its own IExecutionContext and CUDA stream. Use standard CUDA synchronization (cudaStreamSynchronize or CUDA events) to wait for completion on each rank.

NCCL Version Notes#

On NVIDIA B300 platforms, NCCL 2.29.4 can exhibit long (~21–22 second) cold-initialization latency on the first ncclCommInitRank call during NCCL’s “kernels init” phase. Use NCCL 2.30.x or later for multi-device workloads on B300. TensorRT requires only a minimum NCCL 2.x and supports newer minor versions, so you can upgrade NCCL independently of TensorRT.

Refer to Prerequisites, the Support Matrix, the TensorRT 11.0.0 release notes (Known Issues), and the TensorRT 11.1.0 release notes (Fixed Issues) for the full narrative.

Platform Support#

Architecture	OS	CUDA
x86_64	Ubuntu 24.04 Rocky 8	12.9, 13.2
AArch64	Ubuntu 22.04	13.2

Special purpose builds (automotive, RTX, Coverity, DLA) do not support multi-device.
GPU requirements:
- DistCollective: Ampere (SM 80) and later
- Multi-device attention: Blackwell (SM 100) and later

Feature Compatibility#

Feature	Supported	Notes
DLA	No
Ragged Tensor	No
Weight stripped engine	Yes
Refittable weights	No
Weight streaming	Yes	Rank-local: each rank streams its own sharded weights independently.
Safety build	No
Strongly typed	Yes
Precisions	Partial	Multi-device attention: BF16 and FP16 only. `DistCollective`: FP32, FP16, BF16, FP8, INT64, INT32, INT8, UINT8, and BOOL.
Timing cache	Yes
CUDA graphs	Yes
Quantization	Yes