NeMo inference container contains modules and scripts to help export NeMo LLM models to TensorRT-LLM and deploy NeMo LLM models to Triton inference server.
Pull and run the NeMo Framework inference container:
docker pull nvcr.io/ea-bignlp/beta-inf-prerelease/infer:23.08.vr
docker run --gpus all -it --rm --shm-size=30g -v <local_path_to_checkpoint>:/opt/checkpoints -w /opt/NeMo nvcr.io/ea-bignlp/beta-inf-prerelease/infer:23.08.vr
Set --shm-size
to available shared memory.
Set path to Llama checkpoint (<local_path_to_checkpoint>
) to be mounted to in container path /opt/checkpoints
.
Create folder in path /opt/checkpoints/tmp_trt_llm
.
Use TensorRT-LLM APIs to export NeMo checkpoint:
from nemo.export import TensorRTLLM
trt_llm_exporter = TensorRTLLM(model_dir="/opt/checkpoints/tmp_trt_llm_folder/")
trt_llm_exporter.export(nemo_checkpoint_path="/opt/checkpoints/megatron_llama.nemo", model_type="llama", n_gpus=1)
output = trt_llm_exporter.forward(["test1", "how about test 2"], max_output_len=150, top_k=1, top_p=0.0, temperature=0.0)
print("output: ", output)
Please check the TensorRTLLM docstrings for details.