Model Deployment

NeMo inference container contains modules and scripts to help export nemo LLM models to TensorRT-LLM and deploy NeMo LLM models to Triton inference server.

Use NeMo deploy module to serve TrensorRT-LLM model in Triton:

Copy
Copied!
            

from nemo.export import TensorRTLLM from nemo.deploy import DeployPyTriton trt_llm_exporter = TensorRTLLM(model_dir="/opt/checkpoints/tmp_trt_llm/") trt_llm_exporter.export(nemo_checkpoint_path="/opt/checkpoints/megatron_baichuan2.nemo", model_type="baichuan2", n_gpus=1) nm = DeployPyTriton(model=trt_llm_exporter, triton_model_name="BAICHUAN2-7B", port=8000) nm.deploy() nm.serve()

Previous Model Export to TensorRT-LLM
Next Falcon
© Copyright 2023-2024, NVIDIA. Last updated on Feb 22, 2024.