BAGEL#
BAGEL-7B-MoT is a unified multimodal model from ByteDance Seed. It combines a Qwen2 language backbone, a SigLIP-NaViT vision encoder, and mixture-of-transformations layers for mixed understanding and visual-generation training.
Task |
Multimodal Input/Output |
Architecture |
|
Parameters |
14B (two 7B towers) |
HF Org |
Available Models#
BAGEL-7B-MoT
Architecture#
BagelForUnifiedMultimodalBagelForConditionalGeneration
Example HF Models#
Model |
HF ID |
|---|---|
BAGEL-7B-MoT |
Example Recipes#
Recipe |
Dataset |
Description |
|---|---|---|
BAGEL-style packed multimodal data |
Joint text-understanding and image-generation pretraining |
|
BAGEL-style packed multimodal data |
Joint understanding + generation fine-tuning |
Try with NeMo AutoModel#
1. Install (full instructions):
pip install nemo-automodel
2. Clone the repo to get the example recipes:
git clone https://github.com/NVIDIA-NeMo/Automodel.git
cd Automodel
3. Run the recipe from inside the repo:
automodel --nproc-per-node=8 examples/multimodal_pretrain/bagel/bagel_pretrain.yaml