nemo_automodel.components.models.bagel.modeling_siglip_navit

View as Markdown

SigLIP + NaViT vision tower for BAGEL.

NaViT-flavored differences from stock HF SigLIP:

  • Variable-resolution packing: forward takes packed_pixel_values (already patchified to shape (total_patches, C*P*P) after the conv->linear conversion) together with cu_seqlens / max_seqlen so that multiple images with different grids can share one forward call.
  • 2D rotary position embedding (RotaryEmbedding2D) applied on the first/second halves of the head dim, replacing the learnt absolute positional embedding table when config.rope=True.
  • Packed flash-attention (flash_attn_varlen_func) in place of the dense SiglipAttention.forward.
  • Conv2d -> Linear patch embedding swap via SiglipVisionEmbeddings.convert_conv2d_to_linear. Upstream calls this after loading the separately materialized ViT; AM calls it before loading the released BAGEL checkpoint because that checkpoint already stores the linear layout.

Class names and parameter attribute names preserve the BAGEL checkpoint layout so that ema.safetensors keys prefixed with vit_model.vision_model. load via the state-dict adapter without key surgery.

Module Contents

Classes

NameDescription
RotaryEmbedding2D2D RoPE with separate height/width frequency tables.
SiglipAttentionSigLIP attention projection container for the NaViT subclass.
SiglipEncoderStack of SigLIP NaViT encoder layers.
SiglipEncoderLayerSigLIP NaViT encoder layer with packed flash attention.
SiglipFlashAttention2Packed-sequence flash-attention variant with optional 2D RoPE.
SiglipMLPSigLIP vision MLP block used inside the BAGEL NaViT encoder.
SiglipPreTrainedModelAbstract weight-init base for SigLIP vision modules.
SiglipVisionConfigSigLIP vision config with the NaViT rope flag added.
SiglipVisionEmbeddingsNaViT patch embedder.
SiglipVisionModelTop-level vision model. Stored at bagel_model.vit_model per BAGEL’s checkpoint layout.
SiglipVisionTransformerBAGEL SigLIP vision transformer over packed patch embeddings.

Functions

NameDescription
_flash_attn_varlen-
apply_rotary_pos_emb-
convert_conv2d_to_linearModule-level helper mirroring BAGEL’s pretrain_unified_navit.py:525-526.
rotate_half-

Data

__all__

API

class nemo_automodel.components.models.bagel.modeling_siglip_navit.RotaryEmbedding2D(
dim: int,
max_h: int,
max_w: int,
base: int = 10000
)

Bases: Module

2D RoPE with separate height/width frequency tables.

nemo_automodel.components.models.bagel.modeling_siglip_navit.RotaryEmbedding2D._forward_one_side(
grid: torch.Tensor,
inv_freq: torch.Tensor
)
staticmethod
class nemo_automodel.components.models.bagel.modeling_siglip_navit.SiglipAttention(
config: nemo_automodel.components.models.bagel.modeling_siglip_navit.SiglipVisionConfig
)

Bases: Module

SigLIP attention projection container for the NaViT subclass.

Keeping the projections directly on this module preserves the expected parameter names (q_proj, k_proj, v_proj, out_proj) for checkpoint loading. Forward is intentionally omitted because the packed NaViT variant is the only runtime path in this tree.

dropout
= config.attention_dropout
embed_dim
= config.hidden_size
head_dim
= self.embed_dim // self.num_heads
k_proj
= nn.Linear(self.embed_dim, self.embed_dim)
num_heads
= config.num_attention_heads
out_proj
= nn.Linear(self.embed_dim, self.embed_dim)
q_proj
= nn.Linear(self.embed_dim, self.embed_dim)
scale
= self.head_dim ** -0.5
v_proj
= nn.Linear(self.embed_dim, self.embed_dim)
class nemo_automodel.components.models.bagel.modeling_siglip_navit.SiglipEncoder(
config: nemo_automodel.components.models.bagel.modeling_siglip_navit.SiglipVisionConfig
)

Bases: Module

Stack of SigLIP NaViT encoder layers.

layers
nemo_automodel.components.models.bagel.modeling_siglip_navit.SiglipEncoder.forward(
inputs_embeds: torch.Tensor,
cu_seqlens: torch.IntTensor,
max_seqlen: int,
cos_h: typing.Optional[torch.Tensor] = None,
sin_h: typing.Optional[torch.Tensor] = None,
cos_w: typing.Optional[torch.Tensor] = None,
sin_w: typing.Optional[torch.Tensor] = None
) -> torch.Tensor
class nemo_automodel.components.models.bagel.modeling_siglip_navit.SiglipEncoderLayer(
config: nemo_automodel.components.models.bagel.modeling_siglip_navit.SiglipVisionConfig
)

Bases: Module

SigLIP NaViT encoder layer with packed flash attention.

embed_dim
= config.hidden_size
layer_norm1
layer_norm2
mlp
= SiglipMLP(config)
self_attn
= SiglipFlashAttention2(config)
nemo_automodel.components.models.bagel.modeling_siglip_navit.SiglipEncoderLayer.forward(
hidden_states: torch.Tensor,
cu_seqlens: torch.IntTensor,
max_seqlen: int,
cos_h: typing.Optional[torch.Tensor] = None,
sin_h: typing.Optional[torch.Tensor] = None,
cos_w: typing.Optional[torch.Tensor] = None,
sin_w: typing.Optional[torch.Tensor] = None
) -> torch.Tensor
class nemo_automodel.components.models.bagel.modeling_siglip_navit.SiglipFlashAttention2()

Bases: SiglipAttention

Packed-sequence flash-attention variant with optional 2D RoPE.

nemo_automodel.components.models.bagel.modeling_siglip_navit.SiglipFlashAttention2.forward(
hidden_states: torch.Tensor,
cu_seqlens: torch.IntTensor,
max_seqlen: int,
cos_h: typing.Optional[torch.Tensor] = None,
sin_h: typing.Optional[torch.Tensor] = None,
cos_w: typing.Optional[torch.Tensor] = None,
sin_w: typing.Optional[torch.Tensor] = None,
kwargs = {}
) -> torch.Tensor
class nemo_automodel.components.models.bagel.modeling_siglip_navit.SiglipMLP(
config: nemo_automodel.components.models.bagel.modeling_siglip_navit.SiglipVisionConfig
)

Bases: Module

SigLIP vision MLP block used inside the BAGEL NaViT encoder.

activation_fn
= ACT2FN[config.hidden_act]
fc1
fc2
nemo_automodel.components.models.bagel.modeling_siglip_navit.SiglipMLP.forward(
hidden_states: torch.Tensor
) -> torch.Tensor
class nemo_automodel.components.models.bagel.modeling_siglip_navit.SiglipPreTrainedModel()

Bases: PreTrainedModel

Abstract weight-init base for SigLIP vision modules.

_no_split_modules
= ['SiglipVisionEmbeddings', 'SiglipEncoderLayer']
base_model_prefix
= 'siglip'
nemo_automodel.components.models.bagel.modeling_siglip_navit.SiglipPreTrainedModel._init_weights(
module: torch.nn.Module
) -> None
class nemo_automodel.components.models.bagel.modeling_siglip_navit.SiglipVisionConfig(
hidden_size: int = 768,
intermediate_size: int = 3072,
num_hidden_layers: int = 12,
num_attention_heads: int = 12,
num_channels: int = 3,
image_size: int = 224,
patch_size: int = 16,
hidden_act: str = 'gelu_pytorch_tanh',
layer_norm_eps: float = 1e-06,
attention_dropout: float = 0.0,
rope: bool = True,
kwargs = {}
)

Bases: _HFSiglipVisionConfig

SigLIP vision config with the NaViT rope flag added.

Mirrors upstream modeling/bagel/siglip_navit.py::SiglipVisionConfig.

model_type
= 'siglip_vision_model'
class nemo_automodel.components.models.bagel.modeling_siglip_navit.SiglipVisionEmbeddings(
config: nemo_automodel.components.models.bagel.modeling_siglip_navit.SiglipVisionConfig
)

Bases: Module

NaViT patch embedder.

At construction time patch_embedding is a nn.Conv2d. The BAGEL model wrapper calls :meth:convert_conv2d_to_linear to swap it for an equivalent nn.Linear so the forward path can consume pre-patchified packed_pixel_values of shape (total_patches, C*P*P). For the released BAGEL-7B-MoT checkpoint, this conversion must happen before load because the checkpoint already stores the linear tensor shape.

embed_dim
= config.hidden_size
image_size
= config.image_size
num_patches
= self.num_patches_per_side ** 2
num_patches_per_side
= self.image_size // self.patch_size
num_positions
= self.num_patches
patch_embedding
patch_size
= config.patch_size
position_embedding
= nn.Embedding(self.num_positions, self.embed_dim)
nemo_automodel.components.models.bagel.modeling_siglip_navit.SiglipVisionEmbeddings.convert_conv2d_to_linear(
config: nemo_automodel.components.models.bagel.modeling_siglip_navit.SiglipVisionConfig,
meta: bool = False
) -> None

In-place swap Conv2d patch embedding for the mathematically equivalent Linear.

Called once, before checkpoint load, by the BAGEL model wrapper. After this runs, patch_embedding expects a 2-D (total_patches, C*P*P) input instead of 4-D (N, C, H, W).

nemo_automodel.components.models.bagel.modeling_siglip_navit.SiglipVisionEmbeddings.forward(
packed_pixel_values: torch.FloatTensor,
packed_flattened_position_ids: torch.LongTensor
) -> torch.Tensor
class nemo_automodel.components.models.bagel.modeling_siglip_navit.SiglipVisionModel(
config: nemo_automodel.components.models.bagel.modeling_siglip_navit.SiglipVisionConfig
)

Bases: SiglipPreTrainedModel

Top-level vision model. Stored at bagel_model.vit_model per BAGEL’s checkpoint layout.

main_input_name
= 'packed_pixel_values'
vision_model
= SiglipVisionTransformer(config)
nemo_automodel.components.models.bagel.modeling_siglip_navit.SiglipVisionModel.forward(
packed_pixel_values: torch.Tensor,
packed_flattened_position_ids: torch.LongTensor,
cu_seqlens: torch.IntTensor,
max_seqlen: int
) -> torch.Tensor
nemo_automodel.components.models.bagel.modeling_siglip_navit.SiglipVisionModel.get_input_embeddings() -> torch.nn.Module
class nemo_automodel.components.models.bagel.modeling_siglip_navit.SiglipVisionTransformer(
config: nemo_automodel.components.models.bagel.modeling_siglip_navit.SiglipVisionConfig
)

Bases: Module

BAGEL SigLIP vision transformer over packed patch embeddings.

embeddings
= SiglipVisionEmbeddings(config)
encoder
= SiglipEncoder(config)
post_layernorm
rope
nemo_automodel.components.models.bagel.modeling_siglip_navit.SiglipVisionTransformer.forward(
packed_pixel_values: torch.Tensor,
packed_flattened_position_ids: torch.LongTensor,
cu_seqlens: torch.IntTensor,
max_seqlen: int
) -> torch.Tensor
nemo_automodel.components.models.bagel.modeling_siglip_navit._flash_attn_varlen(
args = (),
kwargs = {}
)
nemo_automodel.components.models.bagel.modeling_siglip_navit.apply_rotary_pos_emb(
q: torch.Tensor,
k: torch.Tensor,
cos: torch.Tensor,
sin: torch.Tensor
)
nemo_automodel.components.models.bagel.modeling_siglip_navit.convert_conv2d_to_linear(
vit_model: 'SiglipVisionModel',
config: nemo_automodel.components.models.bagel.modeling_siglip_navit.SiglipVisionConfig,
meta: bool = False
) -> None

Module-level helper mirroring BAGEL’s pretrain_unified_navit.py:525-526.

Equivalent to vit_model.vision_model.embeddings.convert_conv2d_to_linear(config, meta).

nemo_automodel.components.models.bagel.modeling_siglip_navit.rotate_half(
x: torch.Tensor
) -> torch.Tensor
nemo_automodel.components.models.bagel.modeling_siglip_navit.__all__ = ['SiglipVisionConfig', 'RotaryEmbedding2D', 'SiglipVisionEmbeddings', 'SiglipAtt...