Model Deployment

NeMo inference container contains modules and scripts to help export nemo LLM models to TensorRT-LLM and deploy NeMo LLM models to Triton inference server.

Start NeMo Framework inference container and mount the Llama checkpoint. Pull down and run the container as below. Please change the vr below to the version of the container you would like to use::

Copy
Copied!
            

docker run --gpus all -it --rm --shm-size=30g -p 8000:8000 -v ${PWD}:/opt/checkpoints/ -w /opt/NeMo nvcr.io/ea-bignlp/ga-participants/nemofw-inference:vr

Launch deploy script to start serving the model:

Copy
Copied!
            

python scripts/deploy/deploy_triton.py --nemo_checkpoint /opt/checkpoints/megatron_llama.nemo --triton_model_name LLAMA-7B

Sending a Query using NeMo APIs

Send queries using NeMo APIs:

Copy
Copied!
            

from nemo.deploy import NemoQuery nq = NemoQuery(url="localhost:8000", model_name="LLAMA-7B") output = nq.query_llm(prompts=["hello, testing LLAMA inference", "Did you get what you expected?"], max_output_len=200, top_k=1, top_p=0.0, temperature=0.0) print(output)

Set url and model_name based on server and the model name of your service. Please check the NeMoQuery docstrings for details.

Sending a Query using PyTriton

Install PyTriton with pip: https://github.com/triton-inference-server/pytriton. Send querues using PyTriton:

Copy
Copied!
            

from pytriton.client import ModelClient import numpy as np def query_llm(url, model_name, prompts, max_output_len, init_timeout=600.0): str_ndarray = np.array(prompts)[..., np.newaxis] prompts = np.char.encode(str_ndarray, "utf-8") max_output_len = np.full(prompts.shape, max_output_len, dtype=np.int_) with ModelClient(url, model_name, init_timeout_s=init_timeout) as client: result_dict = client.infer_batch(prompts=prompts, max_output_len=max_output_len) sentences = np.char.decode(result_dict["outputs"].astype("bytes"), "utf-8") return sentences output = query_llm(url="localhost:8000", model_name="LLAMA-7B", prompts=["Hey, tell me a joke", "How are you today"], max_output_len=150) print(output)

Use NeMo deploy module to serve TrensorRT-LLM model in Triton:

Copy
Copied!
            

from nemo.export import TensorRTLLM from nemo.deploy import DeployPyTriton trt_llm_exporter = TensorRTLLM(model_dir="/opt/checkpoints/tmp_trt_llm_folder/") trt_llm_exporter.export(nemo_checkpoint_path="/opt/checkpoints/megatron_llama.nemo", model_type="llama", n_gpus=1) nm = DeployPyTriton(model=trt_llm_exporter, triton_model_name="LLAMA-7B", port=8000) nm.deploy() nm.serve()

Previous Model Export to TensorRT-LLM
Next Llama-2 Results
© Copyright 2023-2024, NVIDIA. Last updated on May 17, 2024.