Model Export to TensorRT-LLM

User Guide (Latest Version)

NeMo inference container contains modules and scripts to help export NeMo LLM models to TensorRT-LLM and deploy NeMo LLM models to Triton inference server.

Pull and run the NeMo Framework inference container.Please change the vr below to the version of the container you would like to use:

Copy
Copied!
            

docker pull nvcr.io/ea-bignlp/ga-participants/nemofw-inference:vr docker run --gpus all -it --rm --shm-size=30g -v <local_path_to_checkpoint>:/opt/checkpoints -w /opt/NeMo nvcr.io/ea-bignlp/ga-participants/nemofw-inference:vr

Set --shm-size to available shared memory.

Set path to Falcon checkpoint (<local_path_to_checkpoint>) to be mounted to in container path /opt/checkpoints.

Create folder in path /opt/checkpoints/tmp_trt_llm.

Use TensorRT-LLM APIs to export NeMo checkpoint:

Copy
Copied!
            

from nemo.export import TensorRTLLM trt_llm_exporter = TensorRTLLM(model_dir="/opt/checkpoints/tmp_trt_llm_folder/") trt_llm_exporter.export(nemo_checkpoint_path="/opt/checkpoints/megatron_falcon.nemo", model_type="falcon", n_gpus=1) output = trt_llm_exporter.forward(["test1", "how about test 2"], max_output_len=150, top_k=1, top_p=0.5, temperature=0.5) print("output: ", output)

Falcon 7B/40B/180B needs 1/2/8 GPUs respectively. Please check the TensorRTLLM docstrings for details.

Previous Parameter Efficient Fine-Tuning (PEFT)
Next Model Deployment
© | | | | | | |. Last updated on May 30, 2024.