NeMo inference container contains modules and scripts to help exporting NeMo LLM models to TensorRT-LLM and deploying nemo LLM models to Triton inference server with easy-to-use APIs. In this section, we will show you how to export nemo checkpoints to TensorRT-LLM.
An Ampere GPU (doesn’t work with Volta GPUs currently)
Only NeMo checkpoints for the GPT model with bf16 precision is currently supported.
Only the base GPT and LLaMA v2 models have been tested.
Pull down and run the container as below. Please change the vr below to the version of container you would like to use.:
docker pull nvcr.io/ea-bignlp/beta-inf-prerelease/infer:23.08.vr
docker run --gpus all -it --rm --shm-size=30g -w /opt/NeMo nvcr.io/ea-bignlp/beta-inf-prerelease/infer:23.08.vr
And run the following pytest to see if everything is working.:
py.test -s tests/export/test_nemo_export.py
This pytest will download the GPT-2B-001_bf16_tp1.nemo
checkpoint stored in the HuggingFace and export it to TensorRT-LLM. And then, run inference to see if the service is working.
Please change shared memory size using –shm-size if the test gives you shared memory related error.
You can use the APIs in the export module to export a NeMo checkpoint to TensorRT-LLM. Please see the following code example. Code assumes the GPT-2B-001_bf16_tp1.nemo checkpoint has already been downloaded and mounted to the /opt/checkpoints/ path. And, the /opt/checkpoints/tmp_trt_llm path is also assumed to exist.:
from nemo.export import TensorRTLLM
trt_llm_exporter = TensorRTLLM(model_dir="/opt/checkpoints/tmp_trt_llm_folder/")
trt_llm_exporter.export(nemo_checkpoint_path="/opt/checkpoints/GPT-2B-001_bf16_tp1.nemo", model_type="gptnext", n_gpus=1)
output = trt_llm_exporter.forward(["test1", "how about test 2"], max_output_len=150, top_k=1, top_p=0.0, temperature=0.0)
print("output: ", output)
Please check the TensorRTLLM docstrings for details.