Important
NeMo 2.0 is an experimental feature and currently released in the dev container only: nvcr.io/nvidia/nemo:dev. Please refer to NeMo 2.0 overview for information on getting started.
Deploy NeMo Multimodal Models
Optimized Inference for Multimodal Models using TensorRT
For scenarios requiring optimized performance, NeMo multimodal models can leverage TensorRT. This process involves converting NeMo models into a format compatible with TensorRT using the nemo.export module.
Supported GPUs
TensorRT-LLM supports NVIDIA DGX H100 and NVIDIA H100 GPUs based on the NVIDIA Hopper, NVIDIA Ada Lovelace, NVIDIA Ampere, and NVIDIA Turing architectures.
Supported Models
The following table shows the supported models.
Model Name |
NeMo Precision |
TensorRT Precision |
---|---|---|
Neva |
bfloat16 |
bfloat16 |
Video Neva |
bfloat16 |
bfloat16 |
LITA/VITA |
bfloat16 |
bfloat16 |
VILA |
bfloat16 |
bfloat16 |
SALM |
bfloat16 |
bfloat16 |
Access the Models with a Hugging Face Token
If you want to run inference using the LLama3 model, you’ll need to generate a Hugging Face token that has access to these models. Visit Hugging Face Hugging Face for more information. After you have the token, perform one of the following steps.
Log in to Hugging Face:
huggingface-cli login
Or, set the HF_TOKEN environment variable:
export HF_TOKEN=your_token_here
Export and Deploy a NeMo Multimodal Checkpoint to TensorRT
This section provides an example of how to quickly and easily deploy a NeMo checkpoint to TensorRT. Neva will be used as an example model. Please consult the table above for a complete list of supported models.
Run the Docker container image using the command shown below. Change the
:vr
tag to the version of the container you want to use:docker pull nvcr.io/nvidia/nemo:vr docker run --gpus all -it --rm --shm-size=4g -p 8000:8000 -v /path/to/nemo_neva.nemo:/opt/checkpoints/nemo_neva.nemo -w /opt/NeMo nvcr.io/nvidia/nemo:vr
Run the following deployment script to verify that everything is working correctly. The script exports the downloaded NeMo checkpoint to TensorRT-LLM and subsequently serves it on the Triton server:
python scripts/deploy/multimodal/deploy_triton.py --visual_checkpoint /opt/checkpoints/nemo_neva.nemo --model_type neva --llm_model_type llama --triton_model_name neva --modality vision
If you only want to export the NeMo checkpoint to TensorRT, use the
examples/multimodal/multimodal_llm/neva/neva_export.py
.If the test yields a shared memory-related error, change the shared memory size using
--shm-size
.In a separate terminal, run the following command to get the container ID of the running container. Please find the
nvcr.io/nvidia/nemo:24.vr
image to get the container ID.docker ps
Access the running container and replace
container_id
with the actual container ID as follows:docker exec -it container_id bash
To send a query to the Triton server, run the following script:
python scripts/deploy/multimodal/query.py -mn neva -mt=neva -int="What is in this image?" -im=/path/to/image.jpg
To export and deploy a different model, such as Video Neva, change the model_type and modality in the scripts/deploy/multimodal/deploy_triton.py script.
Use a Script to Run Inference on a Triton Server
You can deploy a multimodal model from a NeMo checkpoint on Triton using the provided script. This deployment uses TensorRT to achieve optimized inference.
Export and Deploy a Multimodal Model to TensorRT
After executing the script, it will export the model to TensorRT and then initiate the service on Triton.
Start the container using the steps described in the previous section.
To begin serving the model, run the following script:
python scripts/deploy/multimodal/deploy_triton.py --visual_checkpoint /opt/checkpoints/nemo_neva.nemo --model_type neva --llm_model_type llama --triton_model_name neva
The following parameters are defined in the
deploy_triton.py
script:modality
- modality of the model. choices=[“vision”, “audio”]. By default, it is set to “vision”.visual_checkpoint
- path of the .nemo of visual model or the path to perception model checkpoint for SALM modelllm_checkpoint
- path of .nemo of LLM. Would be set as visual_checkpoint if not providedmodel_type
- type of the model. choices=[“neva”, “video-neva”, “lita”, “vila”, “vita”, “salm”].llm_model_type
- type of LLM. choices=[“gptnext”, “gpt”, “llama”, “falcon”, “starcoder”, “mixtral”, “gemma”].triton_model_name
name of the model on Triton.triton_model_version
- version of the model. Default is 1.triton_port
- port for the Triton server to listen for requests. Default is 8000.triton_http_address
- HTTP address for the Triton server. Default is 0.0.0.0triton_model_repository
- TensorRT temp folder. Default is/tmp/trt_model_dir/
.num_gpus
- number of GPUs to use for inference. Large models require multi-gpu export.dtype
- data type of the model on TensorRT-LLM. Default is “bfloat16”. Currently only “bfloat16” is supported.max_input_len
- maximum input length of the model.max_output_len
- maximum output length of the model.max_batch_size
- maximum batch size of the model.max_multimodal_len
- maximum lenghth of multimodal inputvision_max_batch_size
- maximum batch size for input images for vision encoder. Default is 1. For models like LITA and VITA on video inference, this should be set to 256.
Note
The parameters described here are generalized and should be compatible with any NeMo checkpoint. It is important; however, that you check the Multimodal model table above for optimized inference model compatibility. We are actively working on extending support to other checkpoints.
Whenever the script is executed, it initiates the service by exporting the NeMo checkpoint to the TensorRT. If you want to skip the exporting step in the optimized inference option, you can specify an empty directory.
To export and deploy a different model, such as Video Neva, change the model_type and modality in the scripts/deploy/multimodal/deploy_triton.py script. Please see the table below to learn more about which model_type and modality is used for a Multimodal model.
Model Name
model_type
modality
Neva
neva
vision
Video Neva
video-neva
vision
LITA
lita
vision
VILA
vila
vision
VITA
vita
vision
SALM
salm
audio
Stop the running container and then run the following command to specify an empty directory:
mkdir tmp_triton_model_repository docker run --gpus all -it --rm --shm-size=4g -p 8000:8000 -v ${PWD}:/opt/checkpoints/ -w /opt/NeMo nvcr.io/nvidia/nemo:vr python scripts/deploy/multimodal/deploy_triton.py --visual_checkpoint /opt/checkpoints/nemo_neva.nemo --model_type neva --llm_model_type llama --triton_model_name neva --triton_model_repository /opt/checkpoints/tmp_triton_model_repository --modality vision
The checkpoint will be exported to the specified folder after executing the script mentioned above.
To load the exported model directly, run the following script within the container:
python scripts/deploy/multimodal/deploy_triton.py --triton_model_name neva --triton_model_repository /opt/checkpoints/tmp_triton_model_repository --model_type neva --llm_model_type llama --modality vision
Send a Query
After starting the service using the provided scripts from the previous section, it will wait for incoming requests. You can send a query to this service in several ways.
Use the Query Script: Execute the query script within the currently running container.
PyTriton: Utilize PyTriton to send requests directly.
HTTP Requests: Make HTTP requests using various tools or libraries.
The following example shows how to execute the query script within the currently running container.
To use a query script, run the following command. For VILA/LITA/VITA models, the input_text should add
<image>\n
before the actual text, such as<image>\n What is in this image?
:python scripts/deploy/multimodal/query.py --url "http://localhost:8000" --model_name neva --model_type neva --input_text "What is in this image?" --input_media /path/to/image.jpg
Change the url and the
model_name
based on your server and the model name of your service. The code in the script can be used as a basis for your client code as well.input_media
is the path to the image or audio file you want to use as input.
Use NeMo Export and Deploy Module APIs to Run Inference
Up until now, we’ve used scripts for exporting and deploying Multimodal models. However, NeMo’s Deploy and Export modules offer straightforward APIs for deploying models to Triton and exporting NeMo checkpoints to TensorRT.
Export a Multimodal Model to TensorRT
You can use the APIs in the export module to export a NeMo checkpoint to TensorRT-LLM. The following code example assumes the nemo_neva.nemo
checkpoint has already mounted to the /opt/checkpoints/
path. Additionally, the /opt/data/image.jpg
is also assumed to exist.
Run the following command:
from nemo.export.tensorrt_mm_exporter import TensorRTMMExporter exporter = TensorRTMMExporter(model_dir="/opt/checkpoints/tmp_triton_model_repository/", modality="vision") exporter.export(visual_checkpoint_path="/opt/checkpoints/nemo_neva.nemo", model_type="neva", llm_model_type="llama", tensor_parallel_size=1) output = exporter.forward("What is in this image?", "/opt/data/image.jpg", max_output_token=30, top_k=1, top_p=0.0, temperature=1.0) print("output: ", output)
Be sure to check the TensorRTMMExporter class docstrings for details.
Deploy a Multimodal Model to TensorRT
You can use the APIs in the deploy module to deploy a TensorRT-LLM model to Triton. The following code example assumes the nemo_neva.nemo
checkpoint has already mounted to the /opt/checkpoints/
path.
Run the following command:
from nemo.export.tensorrt_mm_exporter import TensorRTMMExporter from nemo.deploy import DeployPyTriton exporter = TensorRTMMExporter(model_dir="/opt/checkpoints/tmp_triton_model_repository/", modality="vision") exporter.export(visual_checkpoint_path="/opt/checkpoints/nemo_neva.nemo", model_type="neva", llm_model_type="llama", tensor_parallel_size=1) nm = DeployPyTriton(model=exporter, triton_model_name="neva", port=8000) nm.deploy() nm.serve()
Send a Query
The NeMo Framework provides NemoQueryMultimodal APIs to send a query to the Triton server for convenience. These APIs are only accessible from the NeMo Framework container.
To run the request example using NeMo APIs, run the following command:
from nemo.deploy.multimodal import NemoQueryMultimodal nq = NemoQueryMultimodal(url="localhost:8000", model_name="neva", model_type="neva") output = nq.query(input_text="What is in this image?", input_media="/opt/data/image.jpg", max_output_token=30, top_k=1, top_p=0.0, temperature=1.0) print(output)
Change the url and the
model_name
based on your server and the model name of your service. Please check the NemoQueryMultimodal docstrings for details.
Other Examples
For a complete guide of exporting speech language model such as SALM (to get perception model and merge lora weights), please refer to the document here.