NeMo inference container contains modules and scripts to help export NeMo LLM models to TensorRT-LLM and deploy NeMo LLM models to Triton inference server.
Pull and run the NeMo Framework inference container.Please change the vr
below to the version of the container you would like to use:
docker pull nvcr.io/ea-bignlp/ga-participants/nemofw-inference:vr
docker run --gpus all -it --rm --shm-size=30g -v <local_path_to_checkpoint>:/opt/checkpoints -w /opt/NeMo nvcr.io/ea-bignlp/ga-participants/nemofw-inference:vr
Set --shm-size
to available shared memory.
Set path to GPT checkpoint (<local_path_to_checkpoint>
) to be mounted to in container path /opt/checkpoints
.
Create folder in path /opt/checkpoints/tmp_trt_llm
.
Use TensorRT-LLM APIs to export NeMo checkpoint:
from nemo.export import TensorRTLLM
trt_llm_exporter = TensorRTLLM(model_dir="/opt/checkpoints/tmp_trt_llm_folder/")
trt_llm_exporter.export(nemo_checkpoint_path="/opt/checkpoints/GPT-2B-001_bf16_tp1.nemo", model_type="gptnext", n_gpus=1)
output = trt_llm_exporter.forward(["test1", "how about test 2"], max_output_len=150, top_k=1, top_p=0.0, temperature=0.0)
print("output: ", output)
Please check the TensorRTLLM docstrings for details.