Multimodal Models - NVIDIA Docs

NVIDIA Docs Hub NVIDIA NeMo Framework User Guide Multimodal Models

The NeMo Framework offers robust support for multimodal models, extending its capabilities across four key categories: Multimodal Language Models, Vision-Language Foundation, Text to Image Models, and NeRF. Each category is meticulously designed to cater to specific advancements in the field, leveraging state-of-the-art models to process a wide array of data types, including text, images, and 3D models.

Multimodal Language Models

This category focuses on enriching language models with multimodal capabilities, primarily through visual encoders, to create models that enable interactive visual and textual understanding. Supported models include:

NeVA (LLaVA): Provides training, fine-tuning, and inference capabilities.

Vision-Language Foundation

Aiming at models capable of processing and understanding both visual and textual information, these can be fine-tuned to perform vision-related tasks such as classification and clustering. Furthermore, they can act as foundational modules in vision-language models, text-to-image models, and more. The supported model is:

CLIP: Offers training and inference capabilities, excelling in zero-shot image classification and similarity scoring.
NSFW Content Filtering Model (fine-tuned on CLIP): Provides a vision-based filtering solution to identify explicit content.

Text to Image Models

These models are designed to generate images from textual descriptions, encompassing a range of methodologies from diffusion-based to autoregressive and masked token prediction models. Supported models include:

Foundation Models: Stable Diffusion , Imagen
Finetune Models: DreamBooth , ControlNet , instructPix2Pix

Beyond 2D generation using NeRF

NeMo NeRF concentrates on the creation and manipulation of 3D and 4D models through a modular approach, supporting innovative models like:

DreamFusion: This model generates detailed 3D objects from text descriptions, utilizing pre-trained 2D text-to-image diffusion models and Neural Radiance Fields for rendering.

The comprehensive support for these models across four categories underscores the NeMo Framework’s commitment to advancing multimodal AI development. It provides researchers and developers with the necessary tools to explore the limits of artificial intelligence.

The NVIDIA NeMo™ Framework introduces support for multimodal models, extending its capabilities across four key categories: Multimodal Language Models, Vision-Language Foundation, Text to Image Models, and Beyond 2D generation using NeRF. Each category is designed to cater to specific needs and advancements in the field, leveraging cutting-edge models to handle a wide range of data types including text, images, and 3D models.

Previous BERT Results

Next Multimodal Language Models