BAGEL

View as Markdown

BAGEL-7B-MoT is a unified multimodal model from ByteDance Seed. It combines a Qwen2 language backbone, a SigLIP-NaViT vision encoder, and mixture-of-transformations layers for mixed understanding and visual-generation training.

TaskMultimodal Input/Output
ArchitectureBagelForUnifiedMultimodal, BagelForConditionalGeneration
Parameters14B (two 7B towers)
HF OrgByteDance-Seed

Available Models

  • BAGEL-7B-MoT

Architecture

  • BagelForUnifiedMultimodal
  • BagelForConditionalGeneration

Example HF Models

ModelHF ID
BAGEL-7B-MoTByteDance-Seed/BAGEL-7B-MoT

Example Recipes

RecipeDatasetDescription
bagel_pretrain.yamlBAGEL-style packed multimodal dataJoint text-understanding and image-generation pretraining
bagel_sft.yamlBAGEL-style packed multimodal dataJoint understanding + generation fine-tuning

Try with NeMo AutoModel

1. Install (full instructions):

$pip install nemo-automodel

2. Clone the repo to get the example recipes:

$git clone https://github.com/NVIDIA-NeMo/Automodel.git
$cd Automodel

3. Run the recipe from inside the repo:

$automodel --nproc-per-node=8 examples/multimodal_pretrain/bagel/bagel_pretrain.yaml

Hugging Face Model Cards