Export and Deploy Multimodal Models#

The Export-Deploy library provides comprehensive tools and APIs for exporting and deploying Multimodal Models (MMs) to production environments. This library supports multiple checkpoint formats and offers various deployment paths including TensorRT-LLM deployment through NVIDIA Triton Inference Server.

Overview#

The Export-Deploy library enables seamless conversion of MMs from various checkpoint formats into optimized inference engines, supporting both single-GPU, multi-GPU and multi-node deployments. Whether you’re working with NeMo 2.0 models, Megatron Bridge, Hugging Face models, or other formats, the library provides unified APIs for model export and deployment.

Supported Model/Checkpoint Formats#

The library supports several checkpoint formats, each with specific capabilities and deployment options:

NeMo 2.0 Model/Checkpoints#

NeMo 2.0 represents the current checkpoint format from the NeMo Framework, storing all model-related files in a directory structure rather than a single archive file. NeMo 2.0 model format will be deprecated soon.

Supported Export and Deployment Paths:

Model deployment with Triton
TensorRT-LLM export and deployment with Triton

Megatron Bridge Model/Checkpoints#

Megatron Bridge is designed to serve as the successor to the NeMo 2.0. This new library will eventually replace NeMo 2.0.

Export and Deployment Paths Coming Soon:

Model deployment with Triton and Ray Serve
TensorRT-LLM export and deployment with Triton and Ray Serve

AutoModel Model/Checkpoints#

AutoModel checkpoints are Hugging Face-compatible formats generated through NeMo AutoModel workflows, providing a simplified interface for working with pre-trained language models.

Export and Deployment Paths Coming Soon:

Model deployment with Triton
TensorRT-LLM export and deployment with Triton