Model Export to TensorRT-LLM

NeMo Framework container contains modules and scripts to help export NeMo LLM models to TensorRT-LLM and deploy NeMo LLM models to Triton inference server.

Pull and run the NeMo Framework dedicated container for Gemma:

Copy
Copied!
            

docker pull nvcr.io/nvidia/nemo:24.01.gemma docker run --gpus all -it --rm --shm-size=30g -v <local_path_to_checkpoint>:/opt/checkpoints -w /opt/NeMo nvcr.io/nvidia/nemo:24.01.gemma

Set --shm-size to available shared memory.

Set path to Gemma checkpoint (<local_path_to_checkpoint>) to be mounted to in container path /opt/checkpoints. The path should contain .nemo checkpoint converted by Checkpoint Conversion.

Create folder in path /opt/checkpoints/tmp_trt_llm.

Use TensorRT-LLM APIs to export NeMo checkpoint:

Copy
Copied!
            

from nemo.export import TensorRTLLM trt_llm_exporter = TensorRTLLM(model_dir="/opt/checkpoints/tmp_trt_llm_folder/") trt_llm_exporter.export(nemo_checkpoint_path="/opt/checkpoints/megatron_gemma.nemo", model_type="gemma", n_gpus=1) output = trt_llm_exporter.forward(["test1", "how about test 2"], max_output_len=150, top_k=1, top_p=0.0, temperature=0.0) print("output: ", output)

Please check the TensorRTLLM docstrings for details.

Previous Parameter Efficient Fine-Tuning (PEFT)
Next Model Deployment
© Copyright 2023-2024, NVIDIA. Last updated on Apr 25, 2024.