End to end workflow to use the pytorch LLMAPI workflow#

Start the Triton Server Docker container:

# Replace <yy.mm> with the version of Triton you want to use.
# The command below assumes the the current directory is the
# TRT-LLM backend root git repository.

docker run --rm -ti -v `pwd`:/mnt -w /mnt -v ~/.cache/huggingface:~/.cache/huggingface --gpus all nvcr.io/nvidia/tritonserver:\<yy.mm\>-trtllm-python-py3 bash

Prepare config

 cp -R tensorrt_llm/triton_backend/all_models/llmapi/ llmapi_repo/

Edit llmapi_repo/tensorrt_llm/1/model.yaml to change the model. You can either use a HuggingFace path or a local path. The following is based on meta-llama/Llama-3.1-8B.

This configuration file also allows you to enable CUDA graphs support and set pipeline parallelism and tensor parallelism sizes.

Launch server

python3 tensorrt_llm/triton_backend/scripts/launch_triton_server.py --model_repo=llmapi_repo/

Send request

curl -X POST localhost:8000/v2/models/tensorrt_llm/generate -d '{"text_input": "The future of AI is", "sampling_param_max_tokens":10}' | jq

Optional: include performance metrics

To retrieve detailed performance metrics per request such as KV cache usage, timing breakdowns, and speculative decoding statistics - add "sampling_param_return_perf_metrics": true to your request payload:

curl -X POST localhost:8000/v2/models/tensorrt_llm/generate -d '{"text_input": "Please explain to me what is machine learning?", "sampling_param_max_tokens":10, "sampling_param_return_perf_metrics":true}' | jq

Sample response with performance metrics

{
  "acceptance_rate": "0.0",
  "arrival_time_ns": "76735247746000",
  "first_scheduled_time_ns": "76735248284000",
  "first_token_time_ns": "76735374300000",
  "kv_cache_alloc_new_blocks": "1",
  "kv_cache_alloc_total_blocks": "1",
  "kv_cache_hit_rate": "0.0",
  "kv_cache_missed_block": "1",
  "kv_cache_reused_block": "0",
  "last_token_time_ns": "76736545324000",
  "model_name": "tensorrt_llm",
  "model_version": "1",
  "text_output": "Please explain to me what is machine learning? \n\nMachine learning is a field of computer science that involves the development of algorithms and models that can learn from data without being explicitly programmed. It is a",
  "total_accepted_draft_tokens": "0",
  "total_draft_tokens": "0"
}

inflight_batcher_llm_client.py is not supported yet.

Run test on dataset

python3 tensorrt_llm/triton_backend/tools/inflight_batcher_llm/end_to_end_test.py --dataset tensorrt_llm/triton_backend/ci/L0_backend_trtllm/simple_data.json --max-input-len 500 --test-llmapi --model-name tensorrt_llm

[INFO] Start testing on 13 prompts.
[INFO] Functionality test succeeded.
[INFO] Warm up for benchmarking.
FLAGS.model_name: tensorrt_llm
[INFO] Start benchmarking on 13 prompts.
[INFO] Total Latency: 377.254 ms

Run benchmark

 python3 tensorrt_llm/triton_backend/tools/inflight_batcher_llm/benchmark_core_model.py --max-input-len 500 \
    --tensorrt-llm-model-name tensorrt_llm \
    --test-llmapi \
    dataset --dataset ./tensorrt_llm/triton_backend/tools/dataset/mini_cnn_eval.json \
    --tokenizer-dir meta-llama/Llama-3.1-8B

dataset
Tokenizer: Tokens per word =  1.308
[INFO] Warm up for benchmarking.
[INFO] Start benchmarking on 39 prompts.
[INFO] Total Latency: 1446.623 ms

Start the server on a multi-node configuration#

The srun tool can be used to start the server in a multi-node environment:

srun -N 2 \
    --ntasks-per-node=8 \
    --mpi=pmix \
    --container-image=<your image> \
    --container-mounts=$(pwd)/tensorrt_llm/:/code \
    trtllm-llmapi-launch /opt/tritonserver/bin/tritonserver --model-repository llmapi_repo

Running Multiple Model Instances#

You can run multiple instances of a model using Triton’s instance_group configuration with the gpu_device_ids parameter. This enables load balancing across instances handled by Triton core.

The gpu_device_ids parameter uses semicolons to separate GPU assignments for each instance, and commas to separate GPUs within a single instance (for tensor parallelism). The number of GPU device ID sets (separated by semicolons) must equal the instance count.

Example: 2 Instances with TP=2 on 4 GPUs#

To run 2 instances of a tensor parallel model (TP=2) across 4 GPUs, where instance 1 uses GPUs 0,1 and instance 2 uses GPUs 2,3:

Configure model.yaml with tensor parallelism:

model: meta-llama/Llama-3.1-8B
tensor_parallel_size: 2

Update config.pbtxt to specify instance groups and GPU assignments:

instance_group [
    {kind: KIND_CPU, count: 2}
]

parameters: {
  key: "gpu_device_ids"
  value: {
    string_value: "0,1;2,3"
  }
}

Launch the server with --no-mpi:

python3 tensorrt_llm/triton_backend/scripts/launch_triton_server.py --model_repo=llmapi_repo/ --no-mpi

The --no-mpi flag is required for multi-instance deployments.

Example: Multiple Instances on the Same GPU#

You can also run multiple instances on the same GPU for smaller models. Use kv_cache_config.free_gpu_memory_fraction in model.yaml to limit KV cache memory per instance so all instances fit in GPU memory.

Configure model.yaml with a reduced KV cache fraction:

model: meta-llama/Llama-3.1-8B
kv_cache_config:
  free_gpu_memory_fraction: 0.3

Update config.pbtxt to run 2 instances on GPU 0:

instance_group [
    {kind: KIND_CPU, count: 2}
]

parameters: {
  key: "gpu_device_ids"
  value: {
    string_value: "0;0"
  }
}

Launch the server:

python3 tensorrt_llm/triton_backend/scripts/launch_triton_server.py --model_repo=llmapi_repo/ --no-mpi