Encode-Prefill-Decode (EPD) Flow with NIXL#
For high-performance multimodal inference with large embeddings, Dynamo supports a specialized Encode-Prefill-Decode (EPD) flow using NIXL (RDMA) for zero-copy tensor transfer.
Enabling the Feature#
This is an experimental feature that requires using a specific TensorRT-LLM commit.
To enable it build the dynamo container with the --tensorrtllm-commit flag, followed by the commit hash:
./container/build.sh --framework trtllm --tensorrtllm-commit b4065d8ca64a64eee9fdc64b39cb66d73d4be47c
Key Features#
High Performance: Zero-copy RDMA transfer for embeddings
Dynamic Shape Allocation: Automatically handles variable embedding shapes per image
Multi-Format Support: Works with tensor files (
.pt) and dictionary-based embeddingsHybrid Transfer: Large tensors via NIXL, small metadata via JSON
How to use#
cd $DYNAMO_HOME/components/backends/trtllm
# Launch 3-worker EPD flow with NIXL
./launch/epd_disagg.sh
Configuration#
The EPD flow uses a dedicated Encode Worker that runs separately from the Prefill and Decode workers. The ENCODE_ENDPOINT environment variable specifies how the Prefill worker communicates with the Encode worker:
export ENCODE_ENDPOINT="dyn://dynamo.tensorrt_llm_encode.generate"
This endpoint follows Dynamo’s standard format: dyn://namespace.component.endpoint where the Encode worker registers itself as dynamo.tensorrt_llm_encode.generate.
For local embedding file access, use the --allowed-local-media-path "$ALLOWED_LOCAL_MEDIA_PATH" parameter to specify the secure directory path where embedding files can be loaded from (default: /tmp). This prevents path traversal attacks while allowing flexible file access within the designated directory.
export ALLOWED_LOCAL_MEDIA_PATH="/tmp"
For tensor file size protection, use the --max-file-size-mb "$MAX_FILE_SIZE_MB" parameter to limit the maximum size of downloadable embedding files/Image URLs (default: 50MB). This prevents Denial of Service (DoS) attacks from maliciously large files while accommodating typical embedding file sizes.
export MAX_FILE_SIZE_MB=50
Architecture Overview#
The EPD flow implements a 3-worker architecture for high-performance multimodal inference:
Encode Worker: Loads and processes multimodal embeddings
Prefill Worker: Handles initial context processing and KV-cache generation
Decode Worker: Performs streaming token generation
Request Flow Diagrams#
Prefill-First Disaggregation Strategy#
sequenceDiagram
participant Client
participant Gateway
participant PrefillWorker as "Prefill Worker<br/>(AggregatedHandler)"
participant EncodeWorker as "Encode Worker<br/>(EncodeHandler)"
participant DecodeWorker as "Decode Worker<br/>(DecodeHandler)"
participant NIXL as "NIXL<br/>(RDMA Transfer)"
Note over Client,NIXL: Prefill-First Strategy: Context processing first, then streaming generation
Client->>Gateway: POST /v1/chat/completions<br/>(multimodal request)
Gateway->>PrefillWorker: Route request
Note over PrefillWorker: Check for multimodal content
PrefillWorker->>EncodeWorker: Send request<br/>(contains embedding paths)
Note over EncodeWorker: Load embeddings from file/url<br/>
EncodeWorker->>NIXL: Create readable operation<br/>
EncodeWorker->>PrefillWorker: Send metadata + NIXL info<br/>(JSON: shape, dtype, aux_data)
Note over PrefillWorker: Allocate tensor with dynamic shape
PrefillWorker->>NIXL: Begin read operation
NIXL-->>PrefillWorker: Zero-copy transfer complete<br/>
Note over PrefillWorker: Reconstruct embeddings<br/>(mm_embeddings + special_tokens + offsets)
Note over PrefillWorker: Process full context<br/>(text + multimodal embeddings)
Note over PrefillWorker: Generate KV-cache<br/>(max_tokens=1 in prefill mode)
PrefillWorker->>DecodeWorker: Transfer KV-cache + disaggregated_params<br/>(generation_only mode)
Note over DecodeWorker: Continue generation<br/>(streaming tokens)
DecodeWorker->>Gateway: Stream response chunk 1
Gateway->>Client: Response chunk 1
DecodeWorker->>Gateway: Stream response chunk 2
Gateway->>Client: Response chunk 2
DecodeWorker->>Gateway: ... (continue streaming)
Gateway->>Client: ... (continue streaming)
DecodeWorker->>Gateway: Final response + [DONE]
Gateway->>Client: Final response + [DONE]
Decode-First Disaggregation Strategy#
sequenceDiagram
participant Client
participant Gateway
participant DecodeWorker as "Decode Worker<br/>(DecodeHandler)<br/>PRIMARY"
participant PrefillWorker as "Prefill Worker<br/>(PrefillHandler)"
participant EncodeWorker as "Encode Worker<br/>(EncodeHandler)"
participant NIXL as "NIXL<br/>(RDMA Transfer)"
Note over Client,NIXL: Decode-First Strategy: DecodeWorker orchestrates prefill then handles generation
Client->>Gateway: POST /v1/chat/completions<br/>(multimodal request)
Gateway->>DecodeWorker: Route request<br/>(primary worker)
Note over DecodeWorker: Check disaggregation_strategy == DECODE_FIRST
Note over DecodeWorker: Call remote_prefill() to trigger prefill
DecodeWorker->>PrefillWorker: Send request via remote_prefill()
Note over PrefillWorker: Check for multimodal content
PrefillWorker->>EncodeWorker: Send request<br/>(contains embedding paths)
Note over EncodeWorker: Load embeddings from file<br/>
EncodeWorker->>NIXL: Create readable operation<br/>
EncodeWorker->>PrefillWorker: Send metadata + NIXL info<br/>(JSON: shape, dtype, aux_data)
Note over PrefillWorker: Allocate tensor with dynamic shape
PrefillWorker->>NIXL: Begin read operation
NIXL-->>PrefillWorker: Zero-copy transfer complete<br/>
Note over PrefillWorker: Reconstruct embeddings<br/>(mm_embeddings + special_tokens + offsets)
Note over PrefillWorker: Process full context<br/>(text + multimodal embeddings)
Note over PrefillWorker: Generate prefill response<br/>(max_tokens=1 in prefill mode)
PrefillWorker->>DecodeWorker: Return prefill response<br/>(disaggregated_params)
Note over DecodeWorker: Extract disaggregated_params<br/>from prefill_response
Note over DecodeWorker: Update request with params<br/>request["disaggregated_params"] = response_data["disaggregated_params"]
Note over DecodeWorker: Begin local generation<br/>(generate_locally with prefill state)
DecodeWorker->>Gateway: Stream response chunk 1
Gateway->>Client: Response chunk 1
DecodeWorker->>Gateway: Stream response chunk 2
Gateway->>Client: Response chunk 2
DecodeWorker->>Gateway: ... (continue streaming)
Gateway->>Client: ... (continue streaming)
DecodeWorker->>Gateway: Final response + [DONE]
Gateway->>Client: Final response + [DONE]
How the System Works#
Request Processing: Multimodal requests containing embedding file paths OR urls are routed based on disaggregation strategy
Multimodal Loading: EncodeWorker loads large embedding files and extracts auxiliary metadata
NIXL Transfer: Main tensors transferred via zero-copy RDMA, small metadata via JSON for efficiency
Dynamic Allocation: Consumer workers allocate tensors with exact shapes received from EncodeWorker
Reconstruction: Original embedding format (dictionary or tensor) is reconstructed for model processing
Example Request#
The request format is identical to regular multimodal requests:
curl localhost:8000/v1/chat/completions -H "Content-Type: application/json" -d '{
"model": "meta-llama/Llama-4-Maverick-17B-128E-Instruct",
"messages": [
{
"role": "user",
"content": [
{"type": "text", "text": "Describe the image"},
{
"type": "image_url",
"image_url": {"url": "/path/to/embeddings.pt"}
}
]
}
],
"max_tokens": 160
}'