BAGEL | NVIDIA NeMo AutoModel

BAGEL-7B-MoT is a unified multimodal model from ByteDance Seed. It combines a Qwen2 language backbone, a SigLIP-NaViT vision encoder, and mixture-of-transformations layers for mixed understanding and visual-generation training.


Task	Multimodal Input/Output
Architecture	`BagelForUnifiedMultimodal`, `BagelForConditionalGeneration`
Parameters	14B (two 7B towers)
HF Org	ByteDance-Seed

Available Models

BAGEL-7B-MoT

Architecture

BagelForUnifiedMultimodal
BagelForConditionalGeneration

Example HF Models

Model	HF ID
BAGEL-7B-MoT	`ByteDance-Seed/BAGEL-7B-MoT`

Example Recipes

Recipe	Dataset	Description
bagel_pretrain.yaml	BAGEL-style packed multimodal data	Joint text-understanding and image-generation pretraining
bagel_sft.yaml	BAGEL-style packed multimodal data	Joint understanding + generation fine-tuning

Try with NeMo AutoModel

1. Install (full instructions):

$ pip install nemo-automodel

2. Clone the repo to get the example recipes:

$ git clone https://github.com/NVIDIA-NeMo/Automodel.git
$ cd Automodel

3. Run the recipe from inside the repo:

$ automodel --nproc-per-node=8 examples/multimodal_pretrain/bagel/bagel_pretrain.yaml

Hugging Face Model Cards

ByteDance-Seed/BAGEL-7B-MoT