TensorRT LLM (TRT-LLM) Deployment#
Configure TRT-LLM as the deployment backend for serving models during evaluation.
Configuration Parameters#
Basic Settings#
deployment:
type: trtllm
image: nvcr.io/nvidia/tensorrt-llm/release:1.0.0
checkpoint_path: /path/to/model
served_model_name: your-model-name
port: 8000
Parallelism Configuration#
deployment:
tensor_parallel_size: 4
pipeline_parallel_size: 1
tensor_parallel_size: Number of GPUs to split the model across (default: 4)
pipeline_parallel_size: Number of pipeline stages (default: 1)
Extra Arguments and Endpoints#
deployment:
extra_args: "--ep_size 2"
endpoints:
chat: /v1/chat/completions
completions: /v1/completions
health: /health
The extra_args
field passes extra arguments to the trtllm-serve serve
command.
Complete Example#
defaults:
- execution: slurm/default
- deployment: trtllm
- _self_
deployment:
checkpoint_path: /path/to/checkpoint
served_model_name: llama-3.1-8b-instruct
tensor_parallel_size: 1
extra_args: ""
execution:
account: your-account
output_dir: /path/to/output
walltime: 02:00:00
evaluation:
tasks:
- name: ifeval
- name: gpqa_diamond
env_vars:
HF_TOKEN: HF_TOKEN_FOR_GPQA_DIAMOND # Click request access for GPQA-Diamond: https://huggingface.co/datasets/Idavidrein/gpqa
Use nemo-evaluator-launcher run --dry-run
to check your configuration before running.