Model Deployment - NVIDIA Docs

NVIDIA Docs Hub NVIDIA NeMo Framework User Guide (Latest) Model Deployment

User Guide (Latest Version)

NeMo inference container contains modules and scripts to help export nemo LLM models to TensorRT-LLM and deploy NeMo LLM models to Triton inference server.

Deploy a LLM model with NeMo APIs

Use NeMo deploy module to serve TensorRT-LLM model in Triton:

Copy
Copied!

            
            from nemo.export import TensorRTLLM
from nemo.deploy import DeployPyTriton

trt_llm_exporter = TensorRTLLM(model_dir="/opt/checkpoints/tmp_trt_llm/")
trt_llm_exporter.export(nemo_checkpoint_path="/opt/checkpoints/megatron_chatglm.nemo", model_type="chatglm", n_gpus=1)

nm = DeployPyTriton(model=trt_llm_exporter, triton_model_name="ChatGLM3-6B", port=8000)
nm.deploy()
nm.serve()

Previous Model Export to TensorRT-LLM

Next Multimodal Models