NeMo inference container contains modules and scripts to help export nemo LLM models to TensorRT-LLM and deploy NeMo LLM models to Triton inference server.
Start NeMo Framework inference container and mount the GPT checkpoint. Pull down and run the container as below. Please change the vr
below to the version of the container you would like to use::
docker run --gpus all -it --rm --shm-size=30g -p 8000:8000 -v ${PWD}:/opt/checkpoints/ -w /opt/NeMo nvcr.io/ea-bignlp/ga-participants/nemofw-inference:vr
Launch deploy script to start serving the model:
python scripts/deploy/deploy_triton.py --nemo_checkpoint /opt/checkpoints/GPT-2B-001_bf16_tp1.nemo --triton_model_name GPT-2B
Sending a Query using NeMo APIs
Send queries using NeMo APIs:
from nemo.deploy import NemoQuery
nq = NemoQuery(url="localhost:8000", model_name="GPT-2B")
output = nq.query_llm(prompts=["hello, testing GPT inference", "Did you get what you expected?"], max_output_len=200, top_k=1, top_p=0.0, temperature=0.0)
print(output)
Set url
and model_name
based on server and the model name of your service. Please check the NeMoQuery docstrings for details.
Sending a Query using PyTriton
Install PyTriton with pip: https://github.com/triton-inference-server/pytriton. Send querues using PyTriton:
from pytriton.client import ModelClient
import numpy as np
def query_llm(url, model_name, prompts, max_output_len, init_timeout=600.0):
str_ndarray = np.array(prompts)[..., np.newaxis]
prompts = np.char.encode(str_ndarray, "utf-8")
max_output_len = np.full(prompts.shape, max_output_len, dtype=np.int_)
with ModelClient(url, model_name, init_timeout_s=init_timeout) as client:
result_dict = client.infer_batch(prompts=prompts, max_output_len=max_output_len)
sentences = np.char.decode(result_dict["outputs"].astype("bytes"), "utf-8")
return sentences
output = query_llm(url="localhost:8000", model_name="GPT-2B", prompts=["Hey, tell me a joke", "How are you today"], max_output_len=150)
print(output)
Use NeMo deploy module to serve TrensorRT-LLM model in Triton:
from nemo.export import TensorRTLLM
from nemo.deploy import DeployPyTriton
trt_llm_exporter = TensorRTLLM(model_dir="/opt/checkpoints/tmp_trt_llm_folder/")
trt_llm_exporter.export(nemo_checkpoint_path="/opt/checkpoints/GPT-2B-001_bf16_tp1.nemo", model_type="gptnext", n_gpus=1)
nm = DeployPyTriton(model=trt_llm_exporter, triton_model_name="GPT-2B", port=8000)
nm.deploy()
nm.serve()