Important
You are viewing the NeMo 2.0 documentation. This release introduces significant changes to the API and a new library, NeMo Run. We are currently porting all features from NeMo 1.0 to 2.0. For documentation on previous versions or features not yet available in 2.0, please refer to the NeMo 24.07 documentation.
Deploy NeMo Models by Exporting TensorRT-LLM#
This section shows how to use scripts and APIs to export a NeMo LLM or a quantized model to TensorRT-LLM and deploy it with the NVIDIA Triton Inference Server.
Quick Example#
Follow the steps in the Deploy NeMo LLM main page to download the nemotron-3-8b-base-4k model.
In a terminal, go to the folder where the
Nemotron-3-8B-Base-4k.nemo
file is downloaded. Pull down and run the Docker container image using the command shown below. Change the:vr
tag to the version of the container you want to use:docker pull nvcr.io/nvidia/nemo:vr docker run --gpus all -it --rm --shm-size=4g -p 8000:8000 \ -v ${PWD}/Nemotron-3-8B-Base-4k.nemo:/opt/checkpoints/Nemotron-3-8B-Base-4k.nemo \ -w /opt/NeMo \ nvcr.io/nvidia/nemo:vr
Run the following deployment script to verify that everything is working correctly. The script exports the Nemotron NeMo checkpoint to TensorRT-LLM and subsequently serves it on the Triton server:
python scripts/deploy/nlp/deploy_triton.py \ --nemo_checkpoint /opt/checkpoints/Nemotron-3-8B-Base-4k.nemo \ --model_type gptnext \ --triton_model_name nemotron \ --tensor_parallelism_size 1
If the test yields a shared memory-related error, change the shared memory size using
--shm-size
.In a separate terminal, run the following command to get the container ID of the running container. Please access the
nvcr.io/nvidia/nemo:vr
image to get the container ID.docker ps
Access the running container and replace
container_id
with the actual container ID as follows:docker exec -it container_id bash
To send a query to the Triton server, run the following script:
python scripts/deploy/nlp/query.py -mn nemotron -p "What is the color of a banana?" -mol 5
To export and deploy a different model such as Llama3, Mixtral, or Starcoder, change the model_type in the deploy_triton.py script. Please check below to see the list of the model types.
Use a Script to Deploy NeMo LLMs on a Triton Server#
You can deploy a LLM from a NeMo checkpoint on Triton using the provided script.
Export and Deploy a LLM Model#
After executing the script, it will export the model to TensorRT-LLM and then initiate the service on Triton.
Start the container using the steps described in the Quick Example section.
To begin serving the downloaded model, run the following script:
python scripts/deploy/nlp/deploy_triton.py \ --nemo_checkpoint /opt/checkpoints/Nemotron-3-8B-Base-4k.nemo \ --model_type gptnext \ --triton_model_name nemotron \ --tensor_parallelism_size 1
The following parameters are defined in the
deploy_triton.py
script:--nemo_checkpoint
: path of the .nemo or .qnemo checkpoint file.--model_type
: type of the model. choices=[“gptnext”, “gpt”, “llama”, “falcon”, “starcoder”, “mixtral”, “gemma”].--triton_model_name
: name of the model on Triton.--triton_model_version
: version of the model. Default is 1.--triton_port
: port for the Triton server to listen for requests. Default is 8000.--triton_http_address
: HTTP address for the Triton server. Default is 0.0.0.0--triton_model_repository
: TensorRT temp folder. Default is/tmp/trt_llm_model_dir/
.--num_gpus
: number of GPUs to use for inference. Large models require multi-gpu export. This parameter is deprecated.--tensor_parallelism_size
: number of GPUs to split the tensors for tensor parallelism. Default is 1.--pipeline_parallelism_size
: number of GPUs to split the model for pipeline parallelism. Default is 1.--dtype
: data type of the model on TensorRT-LLM. Default is “bfloat16”. Currently only “bfloat16” is supported.--max_input_len
: maximum input length of the model. Default is 256.--max_output_len
: maximum output length of the model. Default is 256.--max_batch_size
: maximum batch size of the model. Default is 8.--max_num_tokens
: maximum number of tokens. Default is None.--opt_num_tokens
: optimum number of tokens. Default is None.--ptuning_nemo_checkpoint
: source .nemo file for prompt embeddings table.--task_ids
: unique task names for the prompt embedding.--max_prompt_embedding_table_size
: max prompt embedding table size.--lora_ckpt
: a checkpoint list of LoRA weights.--use_lora_plugin
: activates the lora plugin which enables embedding sharing.--lora_target_modules
: adds lora in which modules. Only be activated when use_lora_plugin is enabled.--max_lora_rank
: maximum lora rank for different lora modules. It is used to compute the workspace size of lora plugin.--no_paged_kv_cache
: disables paged kv cache in the TensorRT-LLM.--disable_remove_input_padding
: disables remove input padding option of TensorRT-LLM.--use_parallel_embedding
: enables parallel embedding feature of TensorRT-LLM.--export_fp8_quantized
: manually overrides FP8 quantization settings.--use_fp8_kv_cache
: manually overrides FP8 KV-cache quantization settings.
deprecation warning: num_gpus parameter is deprecated and will be removed after the next release.
Note
The parameters described here are generalized and should be compatible with any NeMo checkpoint. It is important, however, that you check the LLM model table in the main Deploy NeMo LLM page for optimized inference model compatibility. We are actively working on extending support to other checkpoints.
To export and deploy a different model such as Llama3, Mixtral, or Starcoder, change the model_type in deploy_triton.py script. Please see the table below to learn more about which model_type is used for a LLM model.
Model Name
model_type
GPT
gpt
Nemotron
gpt
Llama 2
llama
Llama 3
llama
Llama 3.1
llama
Gemma
gemma
StarCoder1
starcoder
StarCoder2
starcoder
Mistral
llama
Mixtral
mixtral
Whenever the script is executed, it initiates the service by exporting the NeMo checkpoint to the TensorRT-LLM. If you want to skip the exporting step in the optimized inference option, you can specify an empty directory. Stop the running container and then run the following command to specify an empty directory:
mkdir tmp_triton_model_repository docker run --gpus all -it --rm --shm-size=4g -p 8000:8000 \ -v ${PWD}:/opt/checkpoints/ \ -w /opt/NeMo \ nvcr.io/nvidia/nemo:vr python scripts/deploy/nlp/deploy_triton.py \ --nemo_checkpoint /opt/checkpoints/Nemotron-3-8B-Base-4k.nemo \ --model_type gptnext \ --triton_model_name nemotron \ --triton_model_repository /opt/checkpoints/tmp_triton_model_repository \ --tensor_parallelism_size 1
The checkpoint will be exported to the specified folder after executing the script mentioned above.
To load the exported model directly, run the following script within the container:
python scripts/deploy/nlp/deploy_triton.py \ --triton_model_name nemotron \ --triton_model_repository /opt/checkpoints/tmp_triton_model_repository \ --model_type gptnext
Access the models with a Hugging Face token.
If you want to run inference using the StarCoder1, StarCoder2, or LLama3 models, you’ll need to generate a Hugging Face token that has access to these models. Visit Hugging Face for more information. After you have the token, perform one of the following steps.
Log in to Hugging Face:
huggingface-cli login
Or, set the HF_TOKEN environment variable:
export HF_TOKEN=your_token_here
Use Prompt Embedding Tables#
You can use learned virtual tokens to perform a downstream stream task during inference. Once the virtual tokens are learned using the NeMo Framework training container, all the tokens are saved in a .nemo file. You can feed this file into the script as shown in the following command. Since there is no NeMo checkpoint specifically for the virtual token available on NVIDIA NGC or Hugging Face, you’ll need to find or generate a checkpoint.
Assuming there is a checkpoint for the prompt embedding table, run the following command:
python scripts/deploy/nlp/deploy_triton.py \ --nemo_checkpoint /opt/checkpoints/Nemotron-3-8B-Base-4k.nemo \ --model_type gptnext \ --triton_model_name nemotron \ --triton_model_repository /opt/checkpoints/tmp_triton_model_repository \ --max_prompt_embedding_table_size 1024 \ --ptuning_nemo_checkpoint /opt/checkpoints/my_ptuning_table.nemo \ --task_ids "task 1" \ --tensor_parallelism_size 1
max_prompt_embedding_table_size
parameter should be set as the total number of virtual tokens for all of the downstream tasks.To pass multiple NeMo checkpoints, run the following command:
python scripts/deploy/nlp/deploy_triton.py \ --nemo_checkpoint /opt/checkpoints/Nemotron-3-8B-Base-4k.nemo \ --model_type gptnext \ --triton_model_name nemotron \ --triton_model_repository /opt/checkpoints/tmp_triton_model_repository \ --max_prompt_embedding_table_size 1024 \ --ptuning_nemo_checkpoint /opt/checkpoints/my_ptuning_table-1.nemo /opt/checkpoints/my_ptuning_table-2.nemo \ --task_ids "task 1" "task 2" \ --tensor_parallelism_size 1
Please ensure that the combined total number of virtual tokens in my_ptuning_table-1.nemo and my_ptuning_table-2.nemo doesn’t exceed the max_prompt_embedding_table_size parameter.
Use NeMo Export and Deploy Module APIs to Run Inference#
Up until now, we’ve used scripts for exporting and deploying LLM models. However, NeMo’s Deploy and Export modules offer straightforward APIs for deploying models to Triton and exporting NeMo checkpoints to TensorRT-LLM.
Export a LLM Model to TensorRT-LLM#
You can use the APIs in the export module to export a NeMo checkpoint to TensorRT-LLM. The following code example assumes the Nemotron-3-8B-Base-4k.nemo
checkpoint has already been downloaded and mounted to the /opt/checkpoints/
path. Additionally, the /opt/checkpoints/tmp_trt_llm
path is also assumed to exist.
Run the following command:
from nemo.export.tensorrt_llm import TensorRTLLM trt_llm_exporter = TensorRTLLM(model_dir="/opt/checkpoints/tmp_trt_llm/") trt_llm_exporter.export( nemo_checkpoint_path="/opt/checkpoints/Nemotron-3-8B-Base-4k.nemo", model_type="gptnext", n_gpus=1, ) trt_llm_exporter.forward( ["What is the best city in the world?"], max_output_token=15, top_k=1, top_p=0.0, temperature=1.0, )
Be sure to check the TensorRTLLM class docstrings for details.
Deploy a LLM Model to TensorRT-LLM#
You can use the APIs in the deploy module to deploy a TensorRT-LLM model to Triton. The following code example assumes the Nemotron-3-8B-Base-4k.nemo
checkpoint has already been downloaded and mounted to the /opt/checkpoints/
path. Additionally, the /opt/checkpoints/tmp_trt_llm
path is also assumed to exist.
Run the following command:
from nemo.export.tensorrt_llm import TensorRTLLM from nemo.deploy import DeployPyTriton trt_llm_exporter = TensorRTLLM(model_dir="/opt/checkpoints/tmp_trt_llm/") trt_llm_exporter.export( nemo_checkpoint_path="/opt/checkpoints/Nemotron-3-8B-Base-4k.nemo", model_type="gptnext", n_gpus=1, ) nm = DeployPyTriton(model=trt_llm_exporter, triton_model_name="nemotron", port=8000) nm.deploy() nm.serve()
Direct TensorRT-LLM Export for FP8-trained Models#
If you have a FP8-trained checkpoint, produced during pre-training or fine-tuning with NVIDIA Transformer Engine, you can convert it to a FP8 TensorRT-LLM engine directly using nemo.export
. The entry point is the same as with regular .nemo and .qnemo checkpoints:
from nemo.export.tensorrt_llm import TensorRTLLM
trt_llm_exporter = TensorRTLLM(model_dir="/opt/checkpoints/tmp_trt_llm/")
trt_llm_exporter.export(
nemo_checkpoint_path="/opt/checkpoints/llama2-7b-base-fp8.nemo",
model_type="llama",
)
trt_llm_exporter.forward(["Hi, how are you?", "I am good, thanks, how about you?"])
The export settings for quantization can be adjusted via trt_llm_exporter.export
arguments:
fp8_quantized: Optional[bool] = None
: enables/disables FP8 quantizationfp8_kvcache: Optional[bool] = None
: enables/disables FP8 quantization for KV-cache
By default quantization settings are auto-detected from the NeMo checkpoint.
Direct TensorRT-LLM Export for FP8-trained Models#
If you have a FP8-trained checkpoint, produced during pre-training or fine-tuning with NVIDIA Transformer Engine, you can convert it to a FP8 TensorRT-LLM engine directly using nemo.export
:
from nemo.export.tensorrt_llm import TensorRTLLM
trt_llm_exporter = TensorRTLLM(model_dir="/opt/checkpoints/tmp_trt_llm/")
trt_llm_exporter.export(
nemo_checkpoint_path="/opt/checkpoints/llama2-7b-base-fp8.nemo",
model_type="llama",
)
trt_llm_exporter.forward(["Hi, how are you?", "I am good, thanks, how about you?"])