Model Export to TensorRT-LLM

NeMo inference container contains modules and scripts to help export NeMo LLM models to TensorRT-LLM and deploy NeMo LLM models to Triton inference server.

Pull and run the NeMo Framework inference container:


docker pull docker run --gpus all -it --rm --shm-size=30g -v <local_path_to_checkpoint>:/opt/checkpoints -w /opt/NeMo

Set --shm-size to available shared memory.

Set path to Llama checkpoint (<local_path_to_checkpoint>) to be mounted to in container path /opt/checkpoints.

Create folder in path /opt/checkpoints/tmp_trt_llm.

Use TensorRT-LLM APIs to export NeMo checkpoint:


from nemo.export import TensorRTLLM trt_llm_exporter = TensorRTLLM(model_dir="/opt/checkpoints/tmp_trt_llm_folder/") trt_llm_exporter.export(nemo_checkpoint_path="/opt/checkpoints/megatron_llama.nemo", model_type="llama", n_gpus=1) output = trt_llm_exporter.forward(["test1", "how about test 2"], max_output_len=150, top_k=1, top_p=0.0, temperature=0.0) print("output: ", output)

Please check the TensorRTLLM docstrings for details.

© Copyright 2023, NVIDIA. Last updated on Nov 14, 2023.