Three steps are required to deploy the model:
Setup
Prior to deploying a model or pipeline, the model or pipeline must be exported following the steps in Model Export Section. No other additional setup is required as the NeMo container comes with the relevant NVIDIA Triton Inference Server libraries preinstalled and ready to go.
Start NVIDIA Triton Inference Server
Starting the NVIDIA Triton Inference Server is a simple command. First, however, please read the model specific section below to make sure everything is in the correct place. To start the NVIDIA Triton Inference Server:
/opt/tritonserver/bin/tritonserver --log-verbose 2 --model-repository /opt/NeMo-Megatron-Launcher/deployment/server --model-control-mode=explicit --load-model clip_trt
File Copy
Copy the generated .plan
file to deployment/server/clip_vision_trt/1/model.plan
Query NVIDIA Triton Inference Server
In a separate instance of the NeMo container, we can setup a client to query the server. There are a example of the client in deployment/client/clip_client.py
.
Querying clip_trt
will provide tokenization and automatically call clip_vision_trt
using BLS.