Model Deployment - NVIDIA Docs

NVIDIA Docs Hub NVIDIA NeMo Framework User Guide Model Deployment

NeMo inference container contains modules and scripts to help export nemo LLM models to TensorRT-LLM and deploy NeMo LLM models to Triton inference server.

Deploy a LLM model with NeMo APIs

Use NeMo deploy module to serve TrensorRT-LLM model in Triton:

Copy
Copied!

            
            from nemo.export import TensorRTLLM
from nemo.deploy import DeployPyTriton

trt_llm_exporter = TensorRTLLM(model_dir="/opt/checkpoints/tmp_trt_llm_folder/")
trt_llm_exporter.export(nemo_checkpoint_path="/opt/checkpoints/megatron_falcon.nemo", model_type="falcon", n_gpus=1)

nm = DeployPyTriton(model=trt_llm_exporter, triton_model_name="FALCON-7B", port=8000)
nm.deploy()
nm.serve()

Previous Model Export to TensorRT-LLM

Next Mistral