Running Disaggregated Serving with Triton TensorRT-LLM Backend#

Overview#

Disaggregated serving refers to a technique that uses separate GPUs for running the context and generation phases of LLM inference.

For Triton integration, a BLS model named disaggregated_serving_bls has been created that orchestrates the disaggregated serving pipeline. This BLS model requires the TRT-LLM model names that are going to be used for context and generation phases.

This example assumes access to a two GPU device systems with CUDA_VISIBLE_DEVICES set to 0,1.

Model Repository Setup and Start Server#

  1. Setup the model repository as instructed in the LLaMa guide.

  2. Create context and generation models with the desired tensor-parallel configuration. We will be using context and generation model names for context and generation models respectively. The context and generation models should be copying the config tensorrt_llm model.

  3. Set the participant_ids for context and generation models to 1 and 2 respectively.

  4. Set the gpu_device_ids for context and generation models to 0 and 1 respectively.

  5. Set the context_model_name and generation_model_name to context and generation in the disaggregated_serving_bls model configuration.

Your model repository should look like below:

disaggreagted_serving/
|-- context
|   |-- 1
|   `-- config.pbtxt
|-- disaggregated_serving_bls
|   |-- 1
|   |   `-- model.py
|   `-- config.pbtxt
|-- ensemble
|   |-- 1
|   `-- config.pbtxt
|-- generation
|   |-- 1
|   `-- config.pbtxt
|-- postprocessing
|   |-- 1
|   |   `-- model.py
|   `-- config.pbtxt
`-- preprocessing
    |-- 1
    |   `-- model.py
    `-- config.pbtxt
  1. Rename the tensorrt_llm model in the ensemble config.pbtxt file to disaggregated_serving_bls.

  2. Launch the Triton Server:

python3 scripts/launch_triton_server.py --world_size 3 --tensorrt_llm_model_name context,generation --multi-model --disable-spawn-processes

![NOTE]

The world size should be equal to tp*pp of context model + tp*pp of generation model + 1. The additional process is required for the orchestrator.

  1. Send a request to the server.

python3 inflight_batcher_llm/client/end_to_end_grpc_client.py -S -p "Machine learning is"

Creating Multiple Copies of the Context and Generation Models (Data Parallelism)#

You can also create multiple copies of the context and generation models. This can be achieved by setting the participant_ids and gpu_device_ids for each instance.

For example, if you have a context model with tp=2 and you want to create 2 copies of it, you can set the participant_ids to 1,2;3,4, gpu_device_ids to 0,1;2,3 (assuming a 4-GPU system), and set the count in instance_groups section of the model configuration to 2. This will create 2 copies of the context model where the first copy will be on GPU 0 and 1, and the second copy will be on GPU 2 and 3.

Known Issues#

  1. Only C++ version of the backend is supported right now.