Model Deployment

Three steps are required to deploy the model:

  1. Setup

Prior to deploying a model or pipeline, the model or pipeline must be exported following the steps in Model Export Section. No other additional setup is required as the NeMo container comes with the relevant NVIDIA Triton Inference Server libraries preinstalled and ready to go.

  1. Start NVIDIA Triton Inference Server

Starting the NVIDIA Triton Inference Server is a simple command. First, however, please read the model specific section below to make sure everything is in the correct place. To start the NVIDIA Triton Inference Server:

/opt/tritonserver/bin/tritonserver --log-verbose 2 --model-repository /opt/NeMo-Megatron-Launcher/deployment/server --model-control-mode=explicit --load-model stable_diffusion

  1. File Copy

Copy the generated plan directory to the deployment/server/stable_diffusion/1/ directory.

  1. Query NVIDIA Triton Inference Server

In a separate instance of the NeMo container, we can setup a client to query the server. There are a example of the client in deployment/client/stable_diffusion_client.py.

At query time, the values, seed, unconditional_guidance_scale, inference_steps, eta can be used as optional inputs. If these are not set, the defaults are the values set during export. The return is a single numpy array containing num_images_per_prompt images. In the client example, make sure to set the path to the control image.

Previous Model Export to TensorRT-LLM
Next Performance
© Copyright 2023-2024, NVIDIA. Last updated on May 17, 2024.