Model Export to TensorRT-LLM

NeMo inference container contains modules and scripts to help export NeMo LLM models to TensorRT-LLM and deploy NeMo LLM models to Triton inference server.

Pull and run the NeMo Framework inference container.Please change the vr below to the version of the container you would like to use:

Copy
Copied!
            

docker pull nvcr.io/ea-bignlp/ga-participants/nemofw-inference:vr docker run --gpus all -it --rm --shm-size=30g -v <local_path_to_checkpoint>:/opt/checkpoints -w /opt/NeMo nvcr.io/ea-bignlp/ga-participants/nemofw-inference:vr

Set --shm-size to available shared memory.

Set path to GPT checkpoint (<local_path_to_checkpoint>) to be mounted to in container path /opt/checkpoints.

Create folder in path /opt/checkpoints/tmp_trt_llm.

Use TensorRT-LLM APIs to export NeMo checkpoint:

Copy
Copied!
            

from nemo.export import TensorRTLLM trt_llm_exporter = TensorRTLLM(model_dir="/opt/checkpoints/tmp_trt_llm_folder/") trt_llm_exporter.export(nemo_checkpoint_path="/opt/checkpoints/GPT-2B-001_bf16_tp1.nemo", model_type="gptnext", n_gpus=1) output = trt_llm_exporter.forward(["test1", "how about test 2"], max_output_len=150, top_k=1, top_p=0.0, temperature=0.0) print("output: ", output)

Please check the TensorRTLLM docstrings for details.

Previous Torch Distributed Checkpoint (TDC)
Next Model Deployment
© Copyright 2023-2024, NVIDIA. Last updated on Apr 25, 2024.