NeMo Framework Inference Container contains modules and scripts to help exporting NeMo LLM models to TensorRT-LLM with easy-to-use APIs. In this section, we’ll show you how to export a NeMo checkpoint to TensorRT-LLM.
- Supported GPUs:
A100
H100
Supported models with different number of parameters in distributed NeMo checkpoint format are listed below.
Model Name |
Model Parameters |
NeMo Precision |
TensorRT-LLM Precision |
Fine Tuning |
---|---|---|---|---|
GPT | 2B, 8B, 43B | bfloat16 | bfloat16 | SFT, RLHF, SteerLM |
LLAMA2 | 7B, 13B, 70B | bfloat16 | bfloat16 | SFT, RLHF, SteerLM |
You can find the supported NeMo model and TensorRT-LLM model precisions, and fine-tuned variants in the list above. Please note that only NeMo models with the distributed checkpoint format are supported.
First, run the following command to download a NeMo checkpoint:
wget https://huggingface.co/nvidia/GPT-2B-001/resolve/main/GPT-2B-001_bf16_tp1.nemo
GPT-2B-001_bf16_tp1.nemo checkpoint contains a trained GPT model and it will be used as an example in this document. Pull down and run the container as below. Please change the vr below to the version of the container you would like to use:
docker pull nvcr.io/ea-bignlp/ga-participants/nemofw-inference:vr
A LLM model in a NeMo checkpoint can be exported to the TensorRT-LLM using the following script:
mkdir tmp_model_repository
docker run --gpus all -it --rm --shm-size=4g -p 8000:8000 -v ${PWD}:/opt/checkpoints/ -w /opt/NeMo nvcr.io/ea-bignlp/ga-participants/nemofw-inference:vr
python scripts/export/export_to_trt.py --nemo_checkpoint /opt/checkpoints/GPT-2B-001_bf16_tp1.nemo --model_type="gptnext" --model_repository /opt/checkpoints/tmp_model_repository/
Parameters of the export_to_trt.py
script:
nemo_checkpoint
- path of the NeMo checkpoint.model_type
- type of the model. choices=[“gptnext”, “llama”].model_repository
- TensorRT temp folder. Default is/tmp/trt_llm_model_dir/
.num_gpus
- number of GPUs to use for inference. Large models require multi-gpu export.dtype
- data type of the model on TensorRT-LLM. Default is “bf16”. Currently only “bf16” is supported.max_input_len
- maximum input length of the model.max_output_len
- maximum output length of the model.max_batch_size
- maximum batch size of the model.
So far, we only used scripts to export LLM models. NeMo Export module provides easy-to-use APIs for exporting the nemo checkpoints to TensorRT-LLM.
You can use the APIs in the export module to export a NeMo checkpoint to TensorRT-LLM. Please see the following code example.
Code assumes the GPT-2B-001_bf16_tp1.nemo checkpoint has already been downloaded and mounted to the /opt/checkpoints/
path:
from nemo.export import TensorRTLLM
trt_llm_exporter = TensorRTLLM(model_dir="/opt/checkpoints/tmp_triton_model_repository/")
trt_llm_exporter.export(nemo_checkpoint_path="/opt/checkpoints/GPT-2B-001_bf16_tp1.nemo", model_type="gptnext", n_gpus=1)
output = trt_llm_exporter.forward(["What is the best city in the world?"], max_output_token=17, top_k=1, top_p=0.0, temperature=1.0)
print("output: ", output)
Please check the TensorRTLLM docstrings for details.
PyTest for Export Module
Please check the pytests in the container listed as below for more examples on how to use NeMo APIs for export operation.
Export test:
/opt/NeMo/tests/export/test_nemo_export.py