bridge.models.stepfun.modelling_step37.image_insert_embedding#

Image-insert word embedding for Step3.7.

Defines:

  • class:

    ImageForInsert — the dataclass that represents one “image insertion group” travelling from the data preprocess to the model forward. Lives on the model side (vs the data side) because it’s fundamentally part of the Step37Model.forward input contract; the data subpackage re-exports it via data.vlm_datasets.step37_flickr8k.multimodal_utils for backward compatibility.

  • class:

    ImageInsertEmbedding — owns align_projector (nn.Linear(encoder.output_dim, hidden_size)) and provides insert_features, which finds each <im_start> in input_ids, offsets by +1 to the first <im_patch>, and in-place slices input_embeddings[start:start+L] with the projected feature rows.

ImageInsertEmbedding borrows (does not own):

  • language_embedding — a reference to Step37Model.language_model.embedding (Megatron-Core LanguageModelEmbedding). Stored via object.__setattr__ to bypass nn.Module’s auto-registration so the same Parameter tensor isn’t counted twice in parameters() / state_dict().

Output shape: [S, B, H] (sequence-first), matching Megatron-Core LanguageModelEmbedding.forward.

This module performs the vision-text fusion step. The caller (:class:Step37Model.forward_head) supplies a pre-encoded list[ImageForInsert] (with image_features populated by the vision tower); this module projects + scatter-inserts them into the text embedding and returns the fused tensor to be fed as decoder_input of the GPTModel.

Module Contents#

Classes#

ImageForInsert

Language-model insert payload for image / multicrop-patch features.

ImageInsertEmbedding

Word embedding + image-feature projection + <im_start> scatter-insert.

Data#

API#

bridge.models.stepfun.modelling_step37.image_insert_embedding.logger#

‘getLogger(…)’

class bridge.models.stepfun.modelling_step37.image_insert_embedding.ImageForInsert#

Language-model insert payload for image / multicrop-patch features.

Lives on the model side (not the data side) because it’s part of the Step37Model.forward input contract — the data subpackage re-exports it for downstream data-side imports.

.. attribute:: insert_start_token

Token id after which the visual features are inserted (<im_start> for image, <patch_start> for multicrop patches).

.. attribute:: images

Raw image tensor shaped [N, 3, H, W]. Either this or image_features is populated; the encoder pipeline consumes images and populates image_features.

.. attribute:: image_features

Optional precomputed features before the language projector ([N, L, C]). Used for the decoupled- encoder mode where the vision tower runs outside the decoder.

.. attribute:: rope_cu_seqlens

Per-image patch cu_seqlens for visual RoPE (shape [N + 1]).

.. attribute:: rope_max_seq_len

Max patch count across all images in this ImageForInsert (a Python int for serializability).

insert_start_token: int#

None

images: Optional[torch.Tensor]#

None

image_features: Optional[Union[torch.Tensor, list[torch.Tensor]]]#

None

rope_cu_seqlens: Optional[torch.Tensor]#

None

rope_max_seq_len: Optional[int]#

None

class bridge.models.stepfun.modelling_step37.image_insert_embedding.ImageInsertEmbedding(
language_embedding,
encoder_output_dim: int,
hidden_size: int,
projector_bias: bool = False,
)#

Bases: torch.nn.Module

Word embedding + image-feature projection + <im_start> scatter-insert.

Initialization

static _normalize_feature_list(
image_features: Union[torch.Tensor, list[torch.Tensor]],
*,
device: torch.device,
dtype: torch.dtype,
) list[torch.Tensor]#
static insert_features(
input_embeddings: torch.Tensor,
image_features: Union[torch.Tensor, list[torch.Tensor]],
input_ids: torch.Tensor,
flag: int,
) torch.Tensor#

Scatter-insert projected image features at <im_start> positions.

Finds every position where input_ids == flag (the <im_start> id), shifts by +1 to land on the first <im_patch> placeholder, and overwrites the next feature.shape[0] rows of input_embeddings (sequence-first [S, B, H]) with the provided image-feature rows.

forward(
input_ids: torch.IntTensor,
images: Optional[list[bridge.models.stepfun.modelling_step37.image_insert_embedding.ImageForInsert]] = None,
position_ids: Optional[torch.Tensor] = None,
**kwargs,
) torch.Tensor#

Compute word embeddings and scatter-insert pre-encoded image features.

Parameters:
  • input_ids[B, S] long token ids (placeholder positions live at the insert_start_token of each ImageForInsert — typically <im_start>).

  • images – list of :class:ImageForInsert. Each item must have image_features populated (the vision tower runs upstream, e.g. inside Step37Model._encode_images_for_insert).

  • position_ids – forwarded to the underlying word embedding (None is accepted; Step-3.5’s per-layer rotary is computed inside the decoder, so the position arg here is normally ignored).

Returns:

Fused embedding [S, B, H] (sequence-first), ready to feed into the GPT decoder via decoder_input.

bridge.models.stepfun.modelling_step37.image_insert_embedding.__all__#

[‘ImageForInsert’, ‘ImageInsertEmbedding’]