Important
NeMo 2.0 is an experimental feature and currently released in the dev container only: nvcr.io/nvidia/nemo:dev. Please refer to NeMo 2.0 overview for information on getting started.
Multimodal Models
The NeMo Framework offers robust support for multimodal models, extending its capabilities across four key categories: Multimodal Language Models, Vision-Language Foundation Models, Text-to-Image Models, and Neural Radiance Fields (NeRF).
Multimodal Language Models
Multimodal Language Models focus on enriching language models with multimodal capabilities, primarily through visual encoders, to create models that enable interactive visual and textual understanding. Supported models include:
NeVA (LLaVA) - Provides training, fine-tuning, and inference capabilities.
VideoNeVA (LLaVA) - Provides training, and inference capabilities for video modality.
Vision-Language Foundation Models
Aiming at models capable of processing and understanding both visual and textual information, Vision-Language Foundation Models can be fine-tuned to perform vision-related tasks such as classification and clustering. Furthermore, they can act as foundational modules in vision-language models, text-to-image models, and more. The supported model is:
CLIP - Offers training and inference capabilities, excelling in zero-shot image classification and similarity scoring.
NSFW Content Filtering Model (fine-tuned on CLIP) - Provides a vision-based filtering solution to identify explicit content.
Text-to-Image Models
Text-to-Image Models are designed to generate images from textual descriptions, encompassing a range of methodologies from diffusion-based to autoregressive and masked token prediction models. Supported models include:
Foundation Models: Stable Diffusion and SDXL , Imagen
Finetune Models: DreamBooth , ControlNet , instructPix2Pix
Beyond 2D Generation using NeRF
NeMo NeRF concentrates on the creation and manipulation of 3D and 4D models through a modular approach, supporting innovative models like:
DreamFusion: This model generates detailed 3D objects from text descriptions, utilizing pre-trained 2D text-to-image diffusion models and Neural Radiance Fields for rendering.
The comprehensive support for these models across four categories underscores the NeMo Framework’s commitment to advancing multimodal AI development. It provides researchers and developers with the necessary tools to explore the limits of artificial intelligence.
The NVIDIA NeMo™ Framework introduces support for multimodal models, extending its capabilities across four key categories: Multimodal Language Models, Vision-Language Foundation, Text to Image Models, and Beyond 2D generation using NeRF. Each category is designed to cater to specific needs and advancements in the field, leveraging cutting-edge models to handle a wide range of data types including text, images, and 3D models.