Model Deployment - NVIDIA Docs

NeMo inference container contains modules and scripts to help export nemo LLM models to TensorRT-LLM and deploy NeMo LLM models to Triton inference server.

Serve in-framework or TensorRT-LLM model on Triton

Pull and run the NeMo Framework dedicated container for Gemma:

Copy
Copied!

            
            docker run --gpus all -it --rm --shm-size=30g -p 8000:8000 -v ${PWD}:/opt/checkpoints/ -w /opt/NeMo nvcr.io/nvidia/nemo:24.01.gemma

Launch deploy script to start serving the model:

Copy
Copied!

            
            python scripts/deploy/deploy_triton.py --nemo_checkpoint /opt/checkpoints/megatron_gemma.nemo --triton_model_name GEMMA

The depoly script would export checkpoint from --nemo_checkpoint to TensorRT-LLM and deploy to Triton inference server.

Sending a Query using NeMo APIs

Send queries using NeMo APIs:

Copy
Copied!

            
            from nemo.deploy import NemoQuery

nq = NemoQuery(url="localhost:8000", model_name="GEMMA")
output = nq.query_llm(prompts=["hello, testing GEMMA inference", "Did you get what you expected?"], max_output_len=200, top_k=1, top_p=0.0, temperature=0.0)
print(output)

Set url and model_name based on server and the model name of your service. Please check the NeMoQuery docstrings for details.

Sending a Query using PyTriton

Install PyTriton with pip: https://github.com/triton-inference-server/pytriton. Send queries using PyTriton:

Copy
Copied!

            
            from pytriton.client import ModelClient
import numpy as np

def query_llm(url, model_name, prompts, max_output_len, init_timeout=600.0):
   str_ndarray = np.array(prompts)[..., np.newaxis]
   prompts = np.char.encode(str_ndarray, "utf-8")
   max_output_len = np.full(prompts.shape, max_output_len, dtype=np.int_)

   with ModelClient(url, model_name, init_timeout_s=init_timeout) as client:
      result_dict = client.infer_batch(prompts=prompts, max_output_len=max_output_len)
      sentences = np.char.decode(result_dict["outputs"].astype("bytes"), "utf-8")
      return sentences


output = query_llm(url="localhost:8000", model_name="GEMMA", prompts=["Hey, tell me a joke", "How are you today"], max_output_len=150)
print(output)

Deploy a LLM model with NeMo APIs

Use NeMo deploy module to serve TrensorRT-LLM model in Triton:

Copy
Copied!

            
            from nemo.export import TensorRTLLM
from nemo.deploy import DeployPyTriton

trt_llm_exporter = TensorRTLLM(model_dir="/opt/checkpoints/tmp_trt_llm_folder/")
trt_llm_exporter.export(nemo_checkpoint_path="/opt/checkpoints/megatron_gemma.nemo", model_type="gemma", n_gpus=1)

nm = DeployPyTriton(model=trt_llm_exporter, triton_model_name="GEMMA", port=8000)
nm.deploy()
nm.serve()