Model Configuration#

Model Parameters#

The following tables show the parameters in the config.pbtxt of the models in all_models/inflight_batcher_llm. that can be modified before deployment. For optimal performance or custom parameters, please refer to perf_best_practices.

The names of the parameters listed below are the values in the config.pbtxt that can be modified using the fill_template.py script.

NOTE For fields that have comma as the value (e.g. gpu_device_ids, participant_ids), you need to escape the comma with a backslash. For example, if you want to set gpu_device_ids to 0,1 you need to run python3 fill_template.py -i config.pbtxt "gpu_device_ids:0\,1".

The mandatory parameters must be set for the model to run. The optional parameters are not required but can be set to customize the model.

ensemble model#

See here to learn more about ensemble models.

Mandatory parameters

Name

Description

triton_max_batch_size

The maximum batch size that the Triton model instance will run with. Note that for the tensorrt_llm model, the actual runtime batch size can be larger than triton_max_batch_size. The runtime batch size will be determined by the TRT-LLM scheduler based on a number of parameters such as number of available requests in the queue, and the engine build trtllm-build parameters (such max_num_tokens and max_batch_size).

logits_datatype

The data type for context and generation logits.

preprocessing model#

Mandatory parameters

Name

Description

triton_max_batch_size

The maximum batch size that Triton should use with the model.

tokenizer_dir

The path to the tokenizer for the model.

preprocessing_instance_count

The number of instances of the model to run.

max_queue_delay_microseconds

The maximum queue delay in microseconds. Setting this parameter to a value greater than 0 can improve the chances that two requests arriving within max_queue_delay_microseconds will be scheduled in the same TRT-LLM iteration.

max_queue_size

The maximum number of requests allowed in the TRT-LLM queue before rejecting new requests.

Optional parameters

Name

Description

add_special_tokens

The add_special_tokens flag used by HF tokenizers.

multimodal_model_path

The vision engine path used in multimodal workflow.

engine_dir

The path to the engine for the model. This parameter is only needed for multimodal processing to extract the vocab_size from the engine_dir’s config.json for fake_prompt_id mappings.

multimodal_encoders model#

Mandatory parameters

Name

Description

triton_max_batch_size

The maximum batch size that Triton should use with the model.

max_queue_delay_microseconds

The maximum queue delay in microseconds. Setting this parameter to a value greater than 0 can improve the chances that two requests arriving within max_queue_delay_microseconds will be scheduled in the same TRT-LLM iteration.

max_queue_size

The maximum number of requests allowed in the TRT-LLM queue before rejecting new requests.

multimodal_model_path

The vision engine path used in multimodal workflow.

hf_model_path

The Huggingface model path used for llava_onevision and mllama models.

postprocessing model#

Mandatory parameters

Name

Description

triton_max_batch_size

The maximum batch size that Triton should use with the model.

tokenizer_dir

The path to the tokenizer for the model.

postprocessing_instance_count

The number of instances of the model to run.

Optional parameters

Name

Description

skip_special_tokens

The skip_special_tokens flag used by HF detokenizers.

tensorrt_llm model#

The majority of the tensorrt_llm model parameters and input/output tensors can be mapped to parameters in the TRT-LLM C++ runtime API defined in executor.h. Please refer to the Doxygen comments in executor.h for a more detailed description of the parameters below.

Mandatory parameters

Name

Description

triton_backend

The backend to use for the model. Set to tensorrtllm to utilize the C++ TRT-LLM backend implementation. Set to python to utlize the TRT-LLM Python runtime.

triton_max_batch_size

The maximum batch size that the Triton model instance will run with. Note that for the tensorrt_llm model, the actual runtime batch size can be larger than triton_max_batch_size. The runtime batch size will be determined by the TRT-LLM scheduler based on a number of parameters such as number of available requests in the queue, and the engine build trtllm-build parameters (such max_num_tokens and max_batch_size).

decoupled_mode

Whether to use decoupled mode. Must be set to true for requests setting the stream tensor to true.

max_queue_delay_microseconds

The maximum queue delay in microseconds. Setting this parameter to a value greater than 0 can improve the chances that two requests arriving within max_queue_delay_microseconds will be scheduled in the same TRT-LLM iteration.

max_queue_size

The maximum number of requests allowed in the TRT-LLM queue before rejecting new requests.

engine_dir

The path to the engine for the model.

batching_strategy

The batching strategy to use. Set to inflight_fused_batching when enabling in-flight batching support. To disable in-flight batching, set to V1

encoder_input_features_data_type

The dtype for the input tensor encoder_input_features. For the mllama model, this must be TYPE_BF16. For other models like whisper, this is TYPE_FP16.

logits_datatype

The data type for context and generation logits.

Optional parameters

  • General

Name

Description

encoder_engine_dir

When running encoder-decoder models, this is the path to the folder that contains the model configuration and engine for the encoder model.

max_attention_window_size

When using techniques like sliding window attention, the maximum number of tokens that are attended to generate one token. Defaults attends to all tokens in sequence. (default=max_sequence_length)

sink_token_length

Number of sink tokens to always keep in attention window.

exclude_input_in_output

Set to true to only return completion tokens in a response. Set to false to return the prompt tokens concatenated with the generated tokens. (default=false)

cancellation_check_period_ms

The time for cancellation check thread to sleep before doing the next check. It checks if any of the current active requests are cancelled through triton and prevent further execution of them. (default=100)

stats_check_period_ms

The time for the statistics reporting thread to sleep before doing the next check. (default=100)

recv_poll_period_ms

The time for the receiving thread in orchestrator mode to sleep before doing the next check. (default=0)

iter_stats_max_iterations

The maximum number of iterations for which to keep statistics. (default=ExecutorConfig::kDefaultIterStatsMaxIterations)

request_stats_max_iterations

The maximum number of iterations for which to keep per-request statistics. (default=executor::kDefaultRequestStatsMaxIterations)

normalize_log_probs

Controls if log probabilities should be normalized or not. Set to false to skip normalization of output_log_probs. (default=true)

gpu_device_ids

Comma-separated list of GPU IDs to use for this model. Use semicolons to separate multiple instances of the model. If not provided, the model will use all visible GPUs. (default=unspecified)

participant_ids

Comma-separated list of MPI ranks to use for this model. Mandatory when using orchestrator mode with -disable-spawn-process (default=unspecified)

num_nodes

Number of MPI nodes to use for this model. (default=1)

gpu_weights_percent

Set to a number between 0.0 and 1.0 to specify the percentage of weights that reside on GPU instead of CPU and streaming load during runtime. Values less than 1.0 are only supported for an engine built with weight_streaming on. (default=1.0)

  • KV cache

Note that the parameter enable_trt_overlap has been removed from the config.pbtxt. This option allowed to overlap execution of two micro-batches to hide CPU overhead. Optimization work has been done to reduce the CPU overhead and it was found that the overlapping of micro-batches did not provide additional benefits.

Name

Description

max_tokens_in_paged_kv_cache

The maximum size of the KV cache in number of tokens. If unspecified, value is interpreted as ‘infinite’. KV cache allocation is the min of max_tokens_in_paged_kv_cache and value derived from kv_cache_free_gpu_mem_fraction below. (default=unspecified)

kv_cache_free_gpu_mem_fraction

Set to a number between 0 and 1 to indicate the maximum fraction of GPU memory (after loading the model) that may be used for KV cache. (default=0.9)

cross_kv_cache_fraction

Set to a number between 0 and 1 to indicate the maximum fraction of KV cache that may be used for cross attention, and the rest will be used for self attention. Optional param and should be set for encoder-decoder models ONLY. (default=0.5)

kv_cache_host_memory_bytes

Enable offloading to host memory for the given byte size.

enable_kv_cache_reuse

Set to true to reuse previously computed KV cache values (e.g. for system prompt)

  • LoRA cache

Name

Description

lora_cache_optimal_adapter_size

Optimal adapter size used to size cache pages. Typically optimally sized adapters will fix exactly into 1 cache page. (default=8)

lora_cache_max_adapter_size

Used to set the minimum size of a cache page. Pages must be at least large enough to fit a single module, single later adapter_size maxAdapterSize row of weights. (default=64)

lora_cache_gpu_memory_fraction

Fraction of GPU memory used for LoRA cache. Computed as a fraction of left over memory after engine load, and after KV cache is loaded. (default=0.05)

lora_cache_host_memory_bytes

Size of host LoRA cache in bytes. (default=1G)

lora_prefetch_dir

Folder to store the LoRA weights we hope to load during engine initialization.

  • Decoding mode

Name

Description

max_beam_width

The beam width value of requests that will be sent to the executor. (default=1)

decoding_mode

Set to one of the following: {top_k, top_p, top_k_top_p, beam_search, medusa, redrafter, lookahead, eagle} to select the decoding mode. The top_k mode exclusively uses Top-K algorithm for sampling, The top_p mode uses exclusively Top-P algorithm for sampling. The top_k_top_p mode employs both Top-K and Top-P algorithms, depending on the runtime sampling params of the request. Note that the top_k_top_p option requires more memory and has a longer runtime than using top_k or top_p individually; therefore, it should be used only when necessary. beam_search uses beam search algorithm. If not specified, the default is to use top_k_top_p if max_beam_width == 1; otherwise, beam_search is used. When Medusa model is used, medusa decoding mode should be set. However, TensorRT-LLM detects loaded Medusa model and overwrites decoding mode to medusa with warning. Same applies to the ReDrafter, Lookahead and Eagle.

  • Optimization

Name

Description

enable_chunked_context

Set to true to enable context chunking. (default=false)

multi_block_mode

Set to false to disable multi block mode. (default=true)

enable_context_fmha_fp32_acc

Set to true to enable FMHA runner FP32 accumulation. (default=false)

cuda_graph_mode

Set to true to enable cuda graph. (default=false)

cuda_graph_cache_size

Sets the size of the CUDA graph cache, in numbers of CUDA graphs. (default=0)

  • Scheduling

Name

Description

batch_scheduler_policy

Set to max_utilization to greedily pack as many requests as possible in each current in-flight batching iteration. This maximizes the throughput but may result in overheads due to request pause/resume if KV cache limits are reached during execution. Set to guaranteed_no_evict to guarantee that a started request is never paused. (default=guaranteed_no_evict)

  • Medusa

Name

Description

medusa_choices

To specify Medusa choices tree in the format of e.g. “{0, 0, 0}, {0, 1}”. By default, mc_sim_7b_63 choices are used.

  • Eagle

Name

Description

eagle_choices

To specify default per-server Eagle choices tree in the format of e.g. “{0, 0, 0}, {0, 1}”. By default, mc_sim_7b_63 choices are used.

  • Guided decoding

Name

Description

guided_decoding_backend

Set to xgrammar to activate guided decoder.

tokenizer_dir

The guided decoding of tensorrt_llm python backend requires tokenizer’s information.

xgrammar_tokenizer_info_path

The guided decoding of tensorrt_llm C++ backend requires xgrammar’s tokenizer’s info in ‘json’ format.

tensorrt_llm_bls model#

See here to learn more about BLS models.

Mandatory parameters

Name

Description

triton_max_batch_size

The maximum batch size that the model can handle.

decoupled_mode

Whether to use decoupled mode.

bls_instance_count

The number of instances of the model to run. When using the BLS model instead of the ensemble, you should set the number of model instances to the maximum batch size supported by the TRT engine to allow concurrent request execution.

logits_datatype

The data type for context and generation logits.

Optional parameters

  • General

Name

Description

accumulate_tokens

Used in the streaming mode to call the postprocessing model with all accumulated tokens, instead of only one token. This might be necessary for certain tokenizers.

  • Speculative decoding

The BLS model supports speculative decoding. Target and draft triton models are set with the parameters tensorrt_llm_model_name tensorrt_llm_draft_model_name. Speculative decodingis performed by setting num_draft_tokens in the request. use_draft_logits may be set to use logits comparison speculative decoding. Note that return_generation_logits and return_context_logits are not supported when using speculative decoding. Also note that requests with batch size greater than 1 is not supported with speculative decoding right now.

Name

Description

tensorrt_llm_model_name

The name of the TensorRT-LLM model to use.

tensorrt_llm_draft_model_name

The name of the TensorRT-LLM draft model to use.

Model Input and Output#

Below is the lists of input and output tensors for the tensorrt_llm and tensorrt_llm_bls models.

Common Inputs#

Name

Shape

Type

Description

end_id

[1]

int32

End token ID. If not specified, defaults to -1

pad_id

[1]

int32

Padding token ID

temperature

[1]

float32

Sampling Config param: temperature

repetition_penalty

[1]

float

Sampling Config param: repetitionPenalty

min_tokens

[1]

int32_t

Sampling Config param: minTokens

presence_penalty

[1]

float

Sampling Config param: presencePenalty

frequency_penalty

[1]

float

Sampling Config param: frequencyPenalty

seed

[1]

uint64_t

Sampling Config param: seed

return_log_probs

[1]

bool

When true, include log probs in the output

return_context_logits

[1]

bool

When true, include context logits in the output

return_generation_logits

[1]

bool

When true, include generation logits in the output

num_return_sequences

[1]

int32_t

Number of generated sequences per request. (Default=1)

beam_width

[1]

int32_t

Beam width for this request; set to 1 for greedy sampling (Default=1)

prompt_embedding_table

[1]

float16 (model data type)

P-tuning prompt embedding table

prompt_vocab_size

[1]

int32

P-tuning prompt vocab size

return_perf_metrics

[1]

bool

When true, include perf metrics in the output, such as kv cache reuse stats

guided_decoding_guide_type

[1]

string

Guided decoding param: guide_type

guided_decoding_guide

[1]

string

Guided decoding param: guide

The following inputs for lora are for both tensorrt_llm and tensorrt_llm_bls models. The inputs are passed through the tensorrt_llm model and the tensorrt_llm_bls model will refer to the inputs from the tensorrt_llm model.

Name

Shape

Type

Description

lora_task_id

[1]

uint64

The unique task ID for the given LoRA. To perform inference with a specific LoRA for the first time, lora_task_id, lora_weights, and lora_config must all be given. The LoRA will be cached, so that subsequent requests for the same task only require lora_task_id. If the cache is full, the oldest LoRA will be evicted to make space for new ones. An error is returned if lora_task_id is not cached

lora_weights

[ num_lora_modules_layers, D x Hi + Ho x D ]

float (model data type)

Weights for a LoRA adapter. See the config file for more details.

lora_config

[ num_lora_modules_layers, 3]

int32t

Module identifier. See the config file for more details.

Common Outputs#

Note: the timing metrics oputputs are represented as the number of nanoseconds since epoch.

Name

Shape

Type

Description

cum_log_probs

[-1]

float

Cumulative probabilities for each output

output_log_probs

[beam_width, -1]

float

Log probabilities for each output

context_logits

[-1, vocab_size]

float

Context logits for input

generation_logits

[beam_width, seq_len, vocab_size]

float

Generation logits for each output

batch_index

[1]

int32

Batch index

kv_cache_alloc_new_blocks

[1]

int32

KV cache reuse metrics. Number of newly allocated blocks per request. Set the optional input return_perf_metrics to true to include kv_cache_alloc_new_blocks in the outputs.

kv_cache_reused_blocks

[1]

int32

KV cache reuse metrics. Number of reused blocks per request. Set the optional input return_perf_metrics to true to include kv_cache_reused_blocks in the outputs.

kv_cache_alloc_total_blocks

[1]

int32

KV cache reuse metrics. Number of total allocated blocks per request. Set the optional input return_perf_metrics to true to include kv_cache_alloc_total_blocks in the outputs.

arrival_time_ns

[1]

float

Time when the request was received by TRT-LLM. Set the optional input return_perf_metrics to true to include arrival_time_ns in the outputs.

first_scheduled_time_ns

[1]

float

Time when the request was first scheduled. Set the optional input return_perf_metrics to true to include first_scheduled_time_ns in the outputs.

first_token_time_ns

[1]

float

Time when the first token was generated. Set the optional input return_perf_metrics to true to include first_token_time_ns in the outputs.

last_token_time_ns

[1]

float

Time when the last token was generated. Set the optional input return_perf_metrics to true to include last_token_time_ns in the outputs.

acceptance_rate

[1]

float

Acceptance rate of the speculative decoding model. Set the optional input return_perf_metrics to true to include acceptance_rate in the outputs.

total_accepted_draft_tokens

[1]

int32

Number of tokens accepted by the target model in speculative decoding. Set the optional input return_perf_metrics to true to include total_accepted_draft_tokens in the outputs.

total_draft_tokens

[1]

int32

Maximum number of draft tokens acceptable by the target model in speculative decoding. Set the optional input return_perf_metrics to true to include total_draft_tokens in the outputs.

Unique Inputs for tensorrt_llm model#

Name

Shape

Type

Description

input_ids

[-1]

int32

Input token IDs

input_lengths

[1]

int32

Input lengths

request_output_len

[1]

int32

Requested output length

draft_input_ids

[-1]

int32

Draft input IDs

decoder_input_ids

[-1]

int32

Decoder input IDs

decoder_input_lengths

[1]

int32

Decoder input lengths

draft_logits

[-1, -1]

float32

Draft logits

draft_acceptance_threshold

[1]

float32

Draft acceptance threshold

stop_words_list

[2, -1]

int32

List of stop words

bad_words_list

[2, -1]

int32

List of bad words

embedding_bias

[-1]

string

Embedding bias words

runtime_top_k

[1]

int32

Top-k value for runtime top-k sampling

runtime_top_p

[1]

float32

Top-p value for runtime top-p sampling

runtime_top_p_min

[1]

float32

Minimum value for runtime top-p sampling

runtime_top_p_decay

[1]

float32

Decay value for runtime top-p sampling

runtime_top_p_reset_ids

[1]

int32

Reset IDs for runtime top-p sampling

len_penalty

[1]

float32

Controls how to penalize longer sequences in beam search (Default=0.f)

early_stopping

[1]

bool

Enable early stopping

beam_search_diversity_rate

[1]

float32

Beam search diversity rate

stop

[1]

bool

Stop flag

streaming

[1]

bool

Enable streaming

Unique Outputs for tensorrt_llm model#

Name

Shape

Type

Description

output_ids

[-1, -1]

int32

Output token IDs

sequence_length

[-1]

int32

Sequence length

Unique Inputs for tensorrt_llm_bls model#

Name

Shape

Type

Description

text_input

[-1]

string

Prompt text

decoder_text_input

[1]

string

Decoder input text

image_input

[3, 224, 224]

float16

Input image

max_tokens

[-1]

int32

Number of tokens to generate

bad_words

[2, num_bad_words]

int32

Bad words list

stop_words

[2, num_stop_words]

int32

Stop words list

top_k

[1]

int32

Sampling Config param: topK

top_p

[1]

float32

Sampling Config param: topP

length_penalty

[1]

float32

Sampling Config param: lengthPenalty

stream

[1]

bool

When true, stream out tokens as they are generated. When false return only when the full generation has completed (Default=false)

embedding_bias_words

[-1]

string

Embedding bias words

embedding_bias_weights

[-1]

float32

Embedding bias weights

num_draft_tokens

[1]

int32

Number of tokens to get from draft model during speculative decoding

use_draft_logits

[1]

bool

Use logit comparison during speculative decoding

Unique Outputs for tensorrt_llm_bls model#

Name

Shape

Type

Description

text_output

[-1]

string

Text output

Some tips for model configuration#

Below are some tips for configuring models for optimal performance. These recommendations are based on our experiments and may not apply to all use cases. For guidance on other parameters, please refer to the perf_best_practices.

  • Setting the instance_count for models to better utilize inflight batching

    The instance_count parameter in the config.pbtxt file specifies the number of instances of the model to run. Ideally, this should be set to match the maximum batch size supported by the TRT engine, as this allows for concurrent request execution and reduces performance bottlenecks. However, it will also consume more CPU memory resources. While the optimal value isn’t something we can determine in advance, it generally shouldn’t be set to a very small value, such as 1. For most use cases, we have found that setting instance_count to 5 works well across a variety of workloads in our experiments.

  • Adjusting max_batch_size and max_num_tokens to optimize inflight batching

    max_batch_size and max_num_tokens are important parameters for optimizing inflight batching. You can modify max_batch_size in the model configuration file, while max_num_tokens is set during the conversion to a TRT-LLM engine using the trtllm-build command. Tuning these parameters is necessary for different scenarios, and experimentation is currently the best approach to finding optimal values. Generally, the total number of requests should be lower than max_batch_size, and the total tokens should be less than max_num_tokens.