BAGEL#

BAGEL-7B-MoT is a unified multimodal model from ByteDance Seed. It combines a Qwen2 language backbone, a SigLIP-NaViT vision encoder, and mixture-of-transformations layers for mixed understanding and visual-generation training.


Task	Multimodal Input/Output
Architecture	`BagelForUnifiedMultimodal`, `BagelForConditionalGeneration`
Parameters	14B (two 7B towers)
HF Org	ByteDance-Seed

Available Models#

BAGEL-7B-MoT

Architecture#

BagelForUnifiedMultimodal
BagelForConditionalGeneration

Example HF Models#

Model	HF ID
BAGEL-7B-MoT	`ByteDance-Seed/BAGEL-7B-MoT`

Example Recipes#

Recipe	Dataset	Description
`bagel_pretrain.yaml`	BAGEL-style packed multimodal data	Joint text-understanding and image-generation pretraining
`bagel_sft.yaml`	BAGEL-style packed multimodal data	Joint understanding + generation fine-tuning

Try with NeMo AutoModel#

1. Install (full instructions):

pip install nemo-automodel

2. Clone the repo to get the example recipes:

git clone https://github.com/NVIDIA-NeMo/Automodel.git
cd Automodel

3. Run the recipe from inside the repo:

automodel --nproc-per-node=8 examples/multimodal_pretrain/bagel/bagel_pretrain.yaml

Hugging Face Model Cards#

ByteDance-Seed/BAGEL-7B-MoT