Model Deployment - NVIDIA Docs

NeMo inference container contains modules and scripts to help export nemo LLM models to TensorRT-LLM and deploy NeMo LLM models to Triton inference server.

Serve in-framework or TensorRT-LLM model on Triton

Start NeMo Framework inference container and mount the GPT checkpoint. Pull down and run the container as below. Please change the vr below to the version of the container you would like to use::

Copy
Copied!

            
            docker run --gpus all -it --rm --shm-size=30g -p 8000:8000 -v ${PWD}:/opt/checkpoints/ -w /opt/NeMo nvcr.io/ea-bignlp/ga-participants/nemofw-inference:vr

Launch deploy script to start serving the model:

Copy
Copied!

            
            python scripts/deploy/deploy_triton.py --nemo_checkpoint /opt/checkpoints/GPT-2B-001_bf16_tp1.nemo --triton_model_name GPT-2B

Sending a Query using NeMo APIs

Send queries using NeMo APIs:

Copy
Copied!

            
            from nemo.deploy import NemoQuery

nq = NemoQuery(url="localhost:8000", model_name="GPT-2B")
output = nq.query_llm(prompts=["hello, testing GPT inference", "Did you get what you expected?"], max_output_len=200, top_k=1, top_p=0.0, temperature=0.0)
print(output)

Set url and model_name based on server and the model name of your service. Please check the NeMoQuery docstrings for details.

Sending a Query using PyTriton

Install PyTriton with pip: https://github.com/triton-inference-server/pytriton. Send querues using PyTriton:

Copy
Copied!

            
            from pytriton.client import ModelClient
import numpy as np

def query_llm(url, model_name, prompts, max_output_len, init_timeout=600.0):
   str_ndarray = np.array(prompts)[..., np.newaxis]
   prompts = np.char.encode(str_ndarray, "utf-8")
   max_output_len = np.full(prompts.shape, max_output_len, dtype=np.int_)

   with ModelClient(url, model_name, init_timeout_s=init_timeout) as client:
      result_dict = client.infer_batch(prompts=prompts, max_output_len=max_output_len)
      sentences = np.char.decode(result_dict["outputs"].astype("bytes"), "utf-8")
      return sentences


output = query_llm(url="localhost:8000", model_name="GPT-2B", prompts=["Hey, tell me a joke", "How are you today"], max_output_len=150)
print(output)

Deploy a LLM model with NeMo APIs

Use NeMo deploy module to serve TrensorRT-LLM model in Triton:

Copy
Copied!

            
            from nemo.export import TensorRTLLM
from nemo.deploy import DeployPyTriton

trt_llm_exporter = TensorRTLLM(model_dir="/opt/checkpoints/tmp_trt_llm_folder/")
trt_llm_exporter.export(nemo_checkpoint_path="/opt/checkpoints/GPT-2B-001_bf16_tp1.nemo", model_type="gptnext", n_gpus=1)

nm = DeployPyTriton(model=trt_llm_exporter, triton_model_name="GPT-2B", port=8000)
nm.deploy()
nm.serve()