Exporting NeMo Models to TensorRT-LLM

The NeMo Framework Container contains modules and scripts to help exporting NeMo LLM models to TensorRT-LLM with easy-to-use APIs. In this section, we’ll show you how to export a NeMo checkpoint to TensorRT-LLM.

Supported Model and GPUs

Supported GPUs:
- A100
- H100
Supported models with different number of parameters in distributed NeMo checkpoint format are listed below.

Model Name	Model Parameters	NeMo Precision	TensorRT-LLM Precision	Fine Tuning
GPT	2B, 8B, 43B	bfloat16	bfloat16	SFT, RLHF, SteerLM
LLAMA2	7B, 13B, 70B	bfloat16	bfloat16	SFT, RLHF, SteerLM

You can find the supported NeMo model and TensorRT-LLM model precisions, and fine-tuned variants in the list above. Please note that only NeMo models with the distributed checkpoint format are supported.

User Guide

First, run the following command to download a NeMo checkpoint:

wget https://huggingface.co/nvidia/GPT-2B-001/resolve/main/GPT-2B-001_bf16_tp1.nemo

GPT-2B-001_bf16_tp1.nemo checkpoint contains a trained GPT model and it will be used as an example in this document. Pull down and run the container as below. Please change the vr below to the version of the container you would like to use:

docker pull nvcr.io/ea-bignlp/ga-participants/nemofw-inference:vr

A LLM model in a NeMo checkpoint can be exported to the TensorRT-LLM using the following script:

mkdir tmp_model_repository

docker run --gpus all -it --rm --shm-size=4g -p 8000:8000 -v ${PWD}:/opt/checkpoints/ -w /opt/NeMo nvcr.io/ea-bignlp/ga-participants/nemofw-inference:vr

python scripts/export/export_to_trt.py --nemo_checkpoint /opt/checkpoints/GPT-2B-001_bf16_tp1.nemo --model_type="gptnext" --model_repository /opt/checkpoints/tmp_model_repository/

Parameters of the export_to_trt.py script:

nemo_checkpoint - path of the NeMo checkpoint.
model_type - type of the model. choices=[“gptnext”, “llama”].
model_repository - TensorRT temp folder. Default is /tmp/trt_llm_model_dir/.
num_gpus - number of GPUs to use for inference. Large models require multi-gpu export.
dtype - data type of the model on TensorRT-LLM. Default is “bf16”. Currently only “bf16” is supported.
max_input_len - maximum input length of the model.
max_output_len - maximum output length of the model.
max_batch_size - maximum batch size of the model.

NeMo Export Module and APIs

So far, we only used scripts to export LLM models. NeMo Export module provides easy-to-use APIs for exporting the nemo checkpoints to TensorRT-LLM.

You can use the APIs in the export module to export a NeMo checkpoint to TensorRT-LLM. Please see the following code example. Code assumes the GPT-2B-001_bf16_tp1.nemo checkpoint has already been downloaded and mounted to the /opt/checkpoints/ path:

from nemo.export import TensorRTLLM

trt_llm_exporter = TensorRTLLM(model_dir="/opt/checkpoints/tmp_triton_model_repository/")
trt_llm_exporter.export(nemo_checkpoint_path="/opt/checkpoints/GPT-2B-001_bf16_tp1.nemo", model_type="gptnext", n_gpus=1)
output = trt_llm_exporter.forward(["What is the best city in the world?"], max_output_token=17, top_k=1, top_p=0.0, temperature=1.0)
print("output: ", output)

Please check the TensorRTLLM docstrings for details.

PyTest for Export Module

Please check the pytests in the container listed as below for more examples on how to use NeMo APIs for export operation.

Export test: /opt/NeMo/tests/export/nemo_export.py