Multimodal Models | NVIDIA NeMo AutoModel

Introduction

Multimodal models in this section combine understanding and generation capabilities across text and visual modalities. These model families may use custom training recipes, packed multimodal datasets, or task-specific model wrappers beyond the standard image-text-to-text fine-tuning path.

Supported Models

Owner	Model	Architectures
ByteDance Seed	BAGEL	`BagelForUnifiedMultimodal`, `BagelForConditionalGeneration`