NeMo inference container contains modules and scripts to help export NeMo LLM models to TensorRT-LLM and deploy NeMo LLM models to Triton inference server.

Pull and run the NeMo Framework inference container.Please change the vr below to the version of the container you would like to use:


docker pull docker run --gpus all -it --rm --shm-size=30g -v <local_path_to_checkpoint>:/opt/checkpoints -w /opt/NeMo

Set --shm-size to available shared memory.

Set path to Baichuan2 checkpoint (<local_path_to_checkpoint>) to be mounted to in container path /opt/checkpoints.

Create folder in path /opt/checkpoints/tmp_trt_llm.

Use TensorRT-LLM APIs to export NeMo checkpoint:


from nemo.export import TensorRTLLM trt_llm_exporter = TensorRTLLM(model_dir="/opt/checkpoints/tmp_trt_llm/") trt_llm_exporter.export(nemo_checkpoint_path="/opt/checkpoints/megatron_baichuan2.nemo", model_type="baichuan2", n_gpus=1) output = trt_llm_exporter.forward(["test1", "how about test 2"], max_output_token=150, top_k=1, top_p=0.5, temperature=0.5) print("output: ", output)

Please check the TensorRTLLM docstrings for details.

