Model Export to TensorRT-LLM

NeMo inference container contains modules and scripts to help export NeMo LLM models to TensorRT-LLM and deploy NeMo LLM models to Triton inference server.

Pull and run the NeMo Framework inference container.Please change the vr below to the version of the container you would like to use:


docker pull docker run --gpus all -it --rm --shm-size=30g -v <local_path_to_checkpoint>:/opt/checkpoints -w /opt/NeMo

Set --shm-size to available shared memory.

Set path to Falcon checkpoint (<local_path_to_checkpoint>) to be mounted to in container path /opt/checkpoints.

Create folder in path /opt/checkpoints/tmp_trt_llm.

Use TensorRT-LLM APIs to export NeMo checkpoint:


from nemo.export import TensorRTLLM trt_llm_exporter = TensorRTLLM(model_dir="/opt/checkpoints/tmp_trt_llm_folder/") trt_llm_exporter.export(nemo_checkpoint_path="/opt/checkpoints/megatron_falcon.nemo", model_type="falcon", n_gpus=1) output = trt_llm_exporter.forward(["test1", "how about test 2"], max_output_len=150, top_k=1, top_p=0.5, temperature=0.5) print("output: ", output)

Falcon 7B/40B/180B needs 1/2/8 GPUs respectively. Please check the TensorRTLLM docstrings for details.

Previous Parameter Efficient Fine-Tuning (PEFT)
Next Model Deployment
© Copyright 2023-2024, NVIDIA. Last updated on Feb 22, 2024.