Important

NeMo 2.0 is an experimental feature and currently released in the dev container only: nvcr.io/nvidia/nemo:dev. Please refer to NeMo 2.0 overview for information on getting started.

Multimodal Models

The NeMo Framework offers robust support for multimodal models, extending its capabilities across four key categories: Multimodal Language Models, Vision-Language Foundation Models, Text-to-Image Models, and Neural Radiance Fields (NeRF).

Multimodal Language Models

Multimodal Language Models focus on enriching language models with multimodal capabilities, primarily through visual encoders, to create models that enable interactive visual and textual understanding. Supported models include:

  • NeVA (LLaVA) - Provides training, fine-tuning, and inference capabilities.

  • VideoNeVA (LLaVA) - Provides training, and inference capabilities for video modality.

Vision-Language Foundation Models

Aiming at models capable of processing and understanding both visual and textual information, Vision-Language Foundation Models can be fine-tuned to perform vision-related tasks such as classification and clustering. Furthermore, they can act as foundational modules in vision-language models, text-to-image models, and more. The supported model is:

  • CLIP - Offers training and inference capabilities, excelling in zero-shot image classification and similarity scoring.

  • NSFW Content Filtering Model (fine-tuned on CLIP) - Provides a vision-based filtering solution to identify explicit content.

Text-to-Image Models

Text-to-Image Models are designed to generate images from textual descriptions, encompassing a range of methodologies from diffusion-based to autoregressive and masked token prediction models. Supported models include:

  • Foundation Models: Stable Diffusion and SDXL , Imagen

  • Finetune Models: DreamBooth , ControlNet , instructPix2Pix

Beyond 2D Generation using NeRF

NeMo NeRF concentrates on the creation and manipulation of 3D and 4D models through a modular approach, supporting innovative models like:

  • DreamFusion: This model generates detailed 3D objects from text descriptions, utilizing pre-trained 2D text-to-image diffusion models and Neural Radiance Fields for rendering.

The comprehensive support for these models across four categories underscores the NeMo Framework’s commitment to advancing multimodal AI development. It provides researchers and developers with the necessary tools to explore the limits of artificial intelligence.

The NVIDIA NeMo™ Framework introduces support for multimodal models, extending its capabilities across four key categories: Multimodal Language Models, Vision-Language Foundation, Text to Image Models, and Beyond 2D generation using NeRF. Each category is designed to cater to specific needs and advancements in the field, leveraging cutting-edge models to handle a wide range of data types including text, images, and 3D models.