`nemo_automodel.components.models.bagel.model`#

Top-level BAGEL model.

The module supports Stage 1 understanding-only CE and Stage 2 joint understanding and visual generation with flow-matching MSE. The VAE itself stays outside this module tree; the recipe passes VAE-encoded latents to forward.

Top-level module is nn.Module (not PreTrainedModel) + mixed-in HFCheckpointingMixin — matches the llava-onevision pattern and avoids the FSDP double-root issue that bites us with PreTrainedModel-derived roots.

Module Contents#

Classes#

`BagelModel`	Plain container for the three BAGEL submodules.
`BagelForUnifiedMultimodal`	BAGEL mixed-modal LLM wrapper for understanding and optional generation.

Functions#

`_stage_to_int`	Normalize a BAGEL training stage value to `1` or `2`.
`_prepare_config_for_stage`	Apply BAGEL stage/checkpoint config fixes before module construction.
`_convert_patch_embedding_for_packed_vit`	Swap SigLIP patch embedding to Linear for BAGEL packed pixel inputs.

Data#

logger

API#

nemo_automodel.components.models.bagel.model.logger#: ‘getLogger(…)’

nemo_automodel.components.models.bagel.model._stage_to_int(stage: Union[int, str]) → int#: Normalize a BAGEL training stage value to 1 or 2.

nemo_automodel.components.models.bagel.model._prepare_config_for_stage( config: nemo_automodel.components.models.bagel.configuration.BagelConfig, ) → None#

Apply BAGEL stage/checkpoint config fixes before module construction.

AutoModel instantiates custom models as model_cls(config) and lets the common checkpointer load weights later. BAGEL therefore needs the same stage-dependent config mutations that its direct from_pretrained path used to do before BagelModel is built.

nemo_automodel.components.models.bagel.model._convert_patch_embedding_for_packed_vit( model: BagelModel, config: nemo_automodel.components.models.bagel.configuration.BagelConfig, ) → None#: Swap SigLIP patch embedding to Linear for BAGEL packed pixel inputs.

class nemo_automodel.components.models.bagel.model.BagelModel( config: nemo_automodel.components.models.bagel.configuration.BagelConfig, )#

Bases: torch.nn.Module

Plain container for the three BAGEL submodules.

Attribute names (language_model, vit_model, connector, vit_pos_embed) match the checkpoint layout so the state-dict adapter maps identity. There’s no forward logic here - this class exists so that FSDP / state-dict tooling sees the expected tree structure without being confused by the HFCheckpointingMixin root.

When config.visual_gen=True (Stage 2), we additionally attach the generation-side siblings (time_embedder, vae2llm, llm2vae, latent_pos_embed) so the flow-matching head is ready to run. The VAE model itself is NOT owned here; the recipe keeps it separate (frozen, inference-only) and passes already-encoded latents into BagelForUnifiedMultimodal.forward.

Initialization

class nemo_automodel.components.models.bagel.model.BagelForUnifiedMultimodal( config: nemo_automodel.components.models.bagel.configuration.BagelConfig, )#

Bases: nemo_automodel.components.models.common.hf_checkpointing_mixin.HFCheckpointingMixin, torch.nn.Module

BAGEL mixed-modal LLM wrapper for understanding and optional generation.

visual_gen=False gives the Stage 1 understanding-only path. Stage 2 sets visual_gen=True and uses the MoT *_moe_gen parameter siblings, VAE latents prepared by the recipe, and the flow-matching MSE head.

Initialization

config_class#: None

class ModelCapabilities#

Declared parallelism capabilities for this model class.

supports_tp: bool#: False

supports_cp: bool#: False

supports_pp: bool#: False

supports_ep: bool#: False

initialize_weights() → None#

Initialize BAGEL weights after AM materializes a from_config model.

BagelForUnifiedMultimodal is an nn.Module root, not a HF PreTrainedModel root. AM’s meta-device from_config path materializes parameters after sharding and then calls this method. Delegate Qwen/SigLIP subtrees to their HF-style initializers, then initialize BAGEL-only connector and generation modules.

classmethod from_pretrained(

pretrained_model_name_or_path: Union[str, os.PathLike],

*,

stage: Union[int, str] = 1,

strict: bool = False,

**kwargs: Any,

) → nemo_automodel.components.models.bagel.model.BagelForUnifiedMultimodal#

Load a BAGEL-7B-MoT checkpoint directory into this class.

Reads config.json via :meth:BagelConfig.from_pretrained, constructs an empty model, and then loads ema.safetensors filtered by

Class:

BagelStateDictAdapter. Stage 2 VAE weights are loaded by the recipe because the VAE is not owned by this module tree.

Parameters:

pretrained_model_name_or_path – Directory containing the HF-layout BAGEL checkpoint.
stage – 1 (UND only) or 2 (UND + GEN). Strings "stage1" / "stage2" are also accepted.
strict – If True, raise on state-dict keys that don’t match the adapter patterns. Defaults to False for compatibility with checkpoint sidecar files.
**kwargs – Forwarded to BagelConfig.from_pretrained.

Returns:

A fully-initialized BagelForUnifiedMultimodal with weights populated from disk. For Stage 1, visual_gen is forced off on the loaded config so the MoT gen-side path is left untouched.

classmethod supports_config(config: Any) → bool#: Return True if this custom class supports config.

get_input_embeddings() → torch.nn.Module#

get_output_embeddings() → torch.nn.Module#

forward( sequence_length: int, packed_text_ids: torch.LongTensor, packed_text_indexes: torch.LongTensor, sample_lens: List[int], packed_position_ids: torch.LongTensor, nested_attention_masks: Optional[List[torch.Tensor]] = None, split_lens: Optional[List[int]] = None, attn_modes: Optional[List[str]] = None, packed_vit_tokens: Optional[torch.Tensor] = None, packed_vit_token_indexes: Optional[torch.LongTensor] = None, packed_vit_position_ids: Optional[torch.LongTensor] = None, vit_token_seqlens: Optional[torch.Tensor] = None, padded_latent: Optional[torch.Tensor] = None, patchified_vae_latent_shapes: Optional[List[Tuple[int, int]]] = None, packed_latent_position_ids: Optional[torch.LongTensor] = None, packed_vae_token_indexes: Optional[torch.LongTensor] = None, packed_timesteps: Optional[torch.Tensor] = None, mse_loss_indexes: Optional[torch.Tensor] = None, ce_loss_indexes: Optional[torch.Tensor] = None, packed_label_ids: Optional[torch.Tensor] = None, ce_loss_weights: Optional[torch.Tensor] = None, ) → Dict[str, Optional[torch.Tensor]]#

Run the BAGEL mixed-modal forward.

Stage 1 (visual_gen=False) skips the flow-matching branch and MSE computation; Stage 2 activates both. ce_loss_weights is accepted for data-pipeline compatibility but not consumed here - CE is returned per-token (reduction="none") and the trainer may apply weights downstream.

Stage 2 inputs (padded_latent, patchified_vae_latent_shapes, packed_latent_position_ids, packed_vae_token_indexes, packed_timesteps, mse_loss_indexes) are produced by the BAGEL collator when a pack contains t2i/edit samples. padded_latent is the VAE-encoded latent tensor; recipe must call vae_model.encode on the raw padded_images before forward; this module does not own the VAE.

Returns:: dict(ce=Tensor|None, mse=Tensor|None) - both can be None when the pack has no samples of that loss type. Shape-stable with the BAGEL packed-training path.

nemo_automodel.components.models.bagel.model#