nemo_automodel.components.models.bagel.modeling_siglip_navit#
SigLIP + NaViT vision tower for BAGEL.
NaViT-flavored differences from stock HF SigLIP:
Variable-resolution packing:
forwardtakespacked_pixel_values(already patchified to shape(total_patches, C*P*P)after the conv->linear conversion) together withcu_seqlens/max_seqlenso that multiple images with different grids can share one forward call.2D rotary position embedding (
RotaryEmbedding2D) applied on the first/second halves of the head dim, replacing the learnt absolute positional embedding table whenconfig.rope=True.Packed flash-attention (
flash_attn_varlen_func) in place of the denseSiglipAttention.forward.Conv2d -> Linear patch embedding swap via
SiglipVisionEmbeddings.convert_conv2d_to_linear. Upstream calls this after loading the separately materialized ViT; AM calls it before loading the released BAGEL checkpoint because that checkpoint already stores the linear layout.
Class names and parameter attribute names preserve the BAGEL checkpoint layout
so that ema.safetensors keys prefixed with vit_model.vision_model. load
via the state-dict adapter without key surgery.
Module Contents#
Classes#
SigLIP vision config with the NaViT |
|
2D RoPE with separate height/width frequency tables. |
|
NaViT patch embedder. |
|
SigLIP attention projection container for the NaViT subclass. |
|
Packed-sequence flash-attention variant with optional 2D RoPE. |
|
SigLIP vision MLP block used inside the BAGEL NaViT encoder. |
|
SigLIP NaViT encoder layer with packed flash attention. |
|
Stack of SigLIP NaViT encoder layers. |
|
BAGEL SigLIP vision transformer over packed patch embeddings. |
|
Abstract weight-init base for SigLIP vision modules. |
|
Top-level vision model. Stored at |
Functions#
Module-level helper mirroring BAGEL’s |
Data#
API#
- nemo_automodel.components.models.bagel.modeling_siglip_navit.__all__#
[‘SiglipVisionConfig’, ‘RotaryEmbedding2D’, ‘SiglipVisionEmbeddings’, ‘SiglipAttention’, ‘SiglipFlas…
- nemo_automodel.components.models.bagel.modeling_siglip_navit._flash_attn_varlen(*args, **kwargs)#
- class nemo_automodel.components.models.bagel.modeling_siglip_navit.SiglipVisionConfig(
- hidden_size: int = 768,
- intermediate_size: int = 3072,
- num_hidden_layers: int = 12,
- num_attention_heads: int = 12,
- num_channels: int = 3,
- image_size: int = 224,
- patch_size: int = 16,
- hidden_act: str = 'gelu_pytorch_tanh',
- layer_norm_eps: float = 1e-06,
- attention_dropout: float = 0.0,
- rope: bool = True,
- **kwargs,
Bases:
transformers.SiglipVisionConfigSigLIP vision config with the NaViT
ropeflag added.Mirrors upstream
modeling/bagel/siglip_navit.py::SiglipVisionConfig.Initialization
- model_type#
‘siglip_vision_model’
- class nemo_automodel.components.models.bagel.modeling_siglip_navit.RotaryEmbedding2D(dim: int, max_h: int, max_w: int, base: int = 10000)#
Bases:
torch.nn.Module2D RoPE with separate height/width frequency tables.
Initialization
- static _forward_one_side(grid: torch.Tensor, inv_freq: torch.Tensor)#
- nemo_automodel.components.models.bagel.modeling_siglip_navit.rotate_half(x: torch.Tensor) → torch.Tensor#
- nemo_automodel.components.models.bagel.modeling_siglip_navit.apply_rotary_pos_emb(
- q: torch.Tensor,
- k: torch.Tensor,
- cos: torch.Tensor,
- sin: torch.Tensor,
- class nemo_automodel.components.models.bagel.modeling_siglip_navit.SiglipVisionEmbeddings( )#
Bases:
torch.nn.ModuleNaViT patch embedder.
At construction time
patch_embeddingis ann.Conv2d. The BAGEL model wrapper calls :meth:convert_conv2d_to_linearto swap it for an equivalentnn.Linearso the forward path can consume pre-patchifiedpacked_pixel_valuesof shape(total_patches, C*P*P). For the released BAGEL-7B-MoT checkpoint, this conversion must happen before load because the checkpoint already stores the linear tensor shape.Initialization
- convert_conv2d_to_linear(
- config: nemo_automodel.components.models.bagel.modeling_siglip_navit.SiglipVisionConfig,
- meta: bool = False,
In-place swap Conv2d patch embedding for the mathematically equivalent Linear.
Called once, before checkpoint load, by the BAGEL model wrapper. After this runs,
patch_embeddingexpects a 2-D(total_patches, C*P*P)input instead of 4-D(N, C, H, W).
- forward(
- packed_pixel_values: torch.FloatTensor,
- packed_flattened_position_ids: torch.LongTensor,
- nemo_automodel.components.models.bagel.modeling_siglip_navit.convert_conv2d_to_linear(
- vit_model: SiglipVisionModel,
- config: nemo_automodel.components.models.bagel.modeling_siglip_navit.SiglipVisionConfig,
- meta: bool = False,
Module-level helper mirroring BAGEL’s
pretrain_unified_navit.py:525-526.Equivalent to
vit_model.vision_model.embeddings.convert_conv2d_to_linear(config, meta).
- class nemo_automodel.components.models.bagel.modeling_siglip_navit.SiglipAttention( )#
Bases:
torch.nn.ModuleSigLIP attention projection container for the NaViT subclass.
Keeping the projections directly on this module preserves the expected parameter names (
q_proj,k_proj,v_proj,out_proj) for checkpoint loading. Forward is intentionally omitted because the packed NaViT variant is the only runtime path in this tree.Initialization
- class nemo_automodel.components.models.bagel.modeling_siglip_navit.SiglipFlashAttention2( )#
Bases:
nemo_automodel.components.models.bagel.modeling_siglip_navit.SiglipAttentionPacked-sequence flash-attention variant with optional 2D RoPE.
Initialization
- forward(
- hidden_states: torch.Tensor,
- cu_seqlens: torch.IntTensor,
- max_seqlen: int,
- cos_h: Optional[torch.Tensor] = None,
- sin_h: Optional[torch.Tensor] = None,
- cos_w: Optional[torch.Tensor] = None,
- sin_w: Optional[torch.Tensor] = None,
- **kwargs,
- class nemo_automodel.components.models.bagel.modeling_siglip_navit.SiglipMLP( )#
Bases:
torch.nn.ModuleSigLIP vision MLP block used inside the BAGEL NaViT encoder.
Initialization
- forward(hidden_states: torch.Tensor) → torch.Tensor#
- class nemo_automodel.components.models.bagel.modeling_siglip_navit.SiglipEncoderLayer( )#
Bases:
torch.nn.ModuleSigLIP NaViT encoder layer with packed flash attention.
Initialization
- forward(
- hidden_states: torch.Tensor,
- cu_seqlens: torch.IntTensor,
- max_seqlen: int,
- cos_h: Optional[torch.Tensor] = None,
- sin_h: Optional[torch.Tensor] = None,
- cos_w: Optional[torch.Tensor] = None,
- sin_w: Optional[torch.Tensor] = None,
- class nemo_automodel.components.models.bagel.modeling_siglip_navit.SiglipEncoder( )#
Bases:
torch.nn.ModuleStack of SigLIP NaViT encoder layers.
Initialization
- forward(
- inputs_embeds: torch.Tensor,
- cu_seqlens: torch.IntTensor,
- max_seqlen: int,
- cos_h: Optional[torch.Tensor] = None,
- sin_h: Optional[torch.Tensor] = None,
- cos_w: Optional[torch.Tensor] = None,
- sin_w: Optional[torch.Tensor] = None,
- class nemo_automodel.components.models.bagel.modeling_siglip_navit.SiglipVisionTransformer( )#
Bases:
torch.nn.ModuleBAGEL SigLIP vision transformer over packed patch embeddings.
Initialization
- forward(
- packed_pixel_values: torch.Tensor,
- packed_flattened_position_ids: torch.LongTensor,
- cu_seqlens: torch.IntTensor,
- max_seqlen: int,
- class nemo_automodel.components.models.bagel.modeling_siglip_navit.SiglipPreTrainedModel#
Bases:
transformers.modeling_utils.PreTrainedModelAbstract weight-init base for SigLIP vision modules.
- config_class#
None
- base_model_prefix#
‘siglip’
- supports_gradient_checkpointing#
True
- _no_split_modules#
[‘SiglipVisionEmbeddings’, ‘SiglipEncoderLayer’]
- _supports_flash_attn_2#
True
- _supports_sdpa#
False
- _init_weights(module: torch.nn.Module) → None#
- class nemo_automodel.components.models.bagel.modeling_siglip_navit.SiglipVisionModel( )#
Bases:
nemo_automodel.components.models.bagel.modeling_siglip_navit.SiglipPreTrainedModelTop-level vision model. Stored at
bagel_model.vit_modelper BAGEL’s checkpoint layout.Initialization
- config_class#
None
- main_input_name#
‘packed_pixel_values’
- get_input_embeddings() → torch.nn.Module#
- forward(
- packed_pixel_values: torch.Tensor,
- packed_flattened_position_ids: torch.LongTensor,
- cu_seqlens: torch.IntTensor,
- max_seqlen: int,