NeMo Framework container contains modules and scripts to help export NeMo LLM models to TensorRT-LLM and deploy NeMo LLM models to Triton inference server.
Pull and run the NeMo Framework dedicated container for Gemma:
docker pull nvcr.io/nvidia/nemo:24.01.gemma
docker run --gpus all -it --rm --shm-size=30g -v <local_path_to_checkpoint>:/opt/checkpoints -w /opt/NeMo nvcr.io/nvidia/nemo:24.01.gemma
Set --shm-size
to available shared memory.
Set path to Gemma checkpoint (<local_path_to_checkpoint>
) to be mounted to in container path /opt/checkpoints
.
The path should contain .nemo
checkpoint converted by Checkpoint Conversion.
Create folder in path /opt/checkpoints/tmp_trt_llm
.
Use TensorRT-LLM APIs to export NeMo checkpoint:
from nemo.export import TensorRTLLM
trt_llm_exporter = TensorRTLLM(model_dir="/opt/checkpoints/tmp_trt_llm_folder/")
trt_llm_exporter.export(nemo_checkpoint_path="/opt/checkpoints/megatron_gemma.nemo", model_type="gemma", n_gpus=1)
output = trt_llm_exporter.forward(["test1", "how about test 2"], max_output_len=150, top_k=1, top_p=0.0, temperature=0.0)
print("output: ", output)
Please check the TensorRTLLM docstrings for details.