Running Disaggregated Serving with Triton TensorRT-LLM Backend#
Overview#
Disaggregated serving refers to a technique that uses separate GPUs for running the context and generation phases of LLM inference.
For Triton integration, a BLS model named disaggregated_serving_bls has been created that orchestrates the disaggregated serving pipeline. This BLS model requires the TRT-LLM model names that are going to be used for context and generation phases.
This example assumes access to a two GPU device systems with CUDA_VISIBLE_DEVICES
set to 0,1.
Model Repository Setup and Start Server#
Setup the model repository as instructed in the LLaMa guide.
Create context and generation models with the desired tensor-parallel configuration. We will be using
contextandgenerationmodel names for context and generation models respectively. The context and generation models should be copying the config tensorrt_llm model.Set the
participant_idsfor context and generation models to1and2respectively.Set the
gpu_device_idsfor context and generation models to0and1respectively.Set the
context_model_nameandgeneration_model_nametocontextandgenerationin the disaggregated_serving_bls model configuration.
Your model repository should look like below:
disaggreagted_serving/
|-- context
| |-- 1
| `-- config.pbtxt
|-- disaggregated_serving_bls
| |-- 1
| | `-- model.py
| `-- config.pbtxt
|-- ensemble
| |-- 1
| `-- config.pbtxt
|-- generation
| |-- 1
| `-- config.pbtxt
|-- postprocessing
| |-- 1
| | `-- model.py
| `-- config.pbtxt
`-- preprocessing
|-- 1
| `-- model.py
`-- config.pbtxt
Rename the
tensorrt_llmmodel in theensembleconfig.pbtxt file todisaggregated_serving_bls.Launch the Triton Server:
python3 scripts/launch_triton_server.py --world_size 3 --tensorrt_llm_model_name context,generation --multi-model --disable-spawn-processes
![NOTE]
The world size should be equal to
tp*ppof context model +tp*ppof generation model + 1. The additional process is required for the orchestrator.
Send a request to the server.
python3 inflight_batcher_llm/client/end_to_end_grpc_client.py -S -p "Machine learning is"
Creating Multiple Copies of the Context and Generation Models (Data Parallelism)#
You can also create multiple copies of the context and generation models. This can be
achieved by setting the participant_ids and gpu_device_ids for each instance.
For example, if you have a context model with tp=2 and you want to create 2
copies of it, you can set the participant_ids to 1,2;3,4,
gpu_device_ids to 0,1;2,3 (assuming a 4-GPU system), and set the count
in instance_groups section of the model configuration to 2. This will create 2
copies of the context model where the first copy will be on GPU 0 and 1, and the
second copy will be on GPU 2 and 3.
Known Issues#
Only C++ version of the backend is supported right now.