nemo_automodel.components.models.bagel.modeling_siglip_navit
nemo_automodel.components.models.bagel.modeling_siglip_navit
SigLIP + NaViT vision tower for BAGEL.
NaViT-flavored differences from stock HF SigLIP:
- Variable-resolution packing:
forwardtakespacked_pixel_values(already patchified to shape(total_patches, C*P*P)after the conv->linear conversion) together withcu_seqlens/max_seqlenso that multiple images with different grids can share one forward call. - 2D rotary position embedding (
RotaryEmbedding2D) applied on the first/second halves of the head dim, replacing the learnt absolute positional embedding table whenconfig.rope=True. - Packed flash-attention (
flash_attn_varlen_func) in place of the denseSiglipAttention.forward. - Conv2d -> Linear patch embedding swap via
SiglipVisionEmbeddings.convert_conv2d_to_linear. Upstream calls this after loading the separately materialized ViT; AM calls it before loading the released BAGEL checkpoint because that checkpoint already stores the linear layout.
Class names and parameter attribute names preserve the BAGEL checkpoint layout
so that ema.safetensors keys prefixed with vit_model.vision_model. load
via the state-dict adapter without key surgery.
Module Contents
Classes
Functions
Data
API
Bases: Module
2D RoPE with separate height/width frequency tables.
Bases: Module
SigLIP attention projection container for the NaViT subclass.
Keeping the projections directly on this module preserves the expected
parameter names (q_proj, k_proj, v_proj, out_proj) for
checkpoint loading. Forward is intentionally omitted because the packed
NaViT variant is the only runtime path in this tree.
Bases: Module
Stack of SigLIP NaViT encoder layers.
Bases: Module
SigLIP NaViT encoder layer with packed flash attention.
Bases: SiglipAttention
Packed-sequence flash-attention variant with optional 2D RoPE.
Bases: Module
SigLIP vision MLP block used inside the BAGEL NaViT encoder.
Bases: PreTrainedModel
Abstract weight-init base for SigLIP vision modules.
Bases: _HFSiglipVisionConfig
SigLIP vision config with the NaViT rope flag added.
Mirrors upstream modeling/bagel/siglip_navit.py::SiglipVisionConfig.
Bases: Module
NaViT patch embedder.
At construction time patch_embedding is a nn.Conv2d. The BAGEL
model wrapper calls :meth:convert_conv2d_to_linear to swap it for an
equivalent nn.Linear so the forward path can consume pre-patchified
packed_pixel_values of shape (total_patches, C*P*P). For the
released BAGEL-7B-MoT checkpoint, this conversion must happen before load
because the checkpoint already stores the linear tensor shape.
In-place swap Conv2d patch embedding for the mathematically equivalent Linear.
Called once, before checkpoint load, by the BAGEL model wrapper. After this
runs, patch_embedding expects a 2-D (total_patches, C*P*P)
input instead of 4-D (N, C, H, W).
Bases: SiglipPreTrainedModel
Top-level vision model. Stored at bagel_model.vit_model per BAGEL’s checkpoint layout.
Bases: Module
BAGEL SigLIP vision transformer over packed patch embeddings.
Module-level helper mirroring BAGEL’s pretrain_unified_navit.py:525-526.
Equivalent to vit_model.vision_model.embeddings.convert_conv2d_to_linear(config, meta).