bridge.models.stepfun.modelling_step37.image_insert_embedding#
Image-insert word embedding for Step3.7.
Defines:
- class:
ImageForInsert— the dataclass that represents one “image insertion group” travelling from the data preprocess to the model forward. Lives on the model side (vs the data side) because it’s fundamentally part of theStep37Model.forwardinput contract; the data subpackage re-exports it viadata.vlm_datasets.step37_flickr8k.multimodal_utilsfor backward compatibility.
- class:
ImageInsertEmbedding— ownsalign_projector(nn.Linear(encoder.output_dim, hidden_size)) and providesinsert_features, which finds each<im_start>ininput_ids, offsets by +1 to the first<im_patch>, and in-place slicesinput_embeddings[start:start+L]with the projected feature rows.
ImageInsertEmbedding borrows (does not own):
language_embedding— a reference toStep37Model.language_model.embedding(Megatron-CoreLanguageModelEmbedding). Stored viaobject.__setattr__to bypassnn.Module’s auto-registration so the same Parameter tensor isn’t counted twice inparameters()/state_dict().
Output shape: [S, B, H] (sequence-first), matching Megatron-Core
LanguageModelEmbedding.forward.
This module performs the vision-text fusion step. The caller
(:class:Step37Model.forward_head) supplies a pre-encoded
list[ImageForInsert] (with image_features populated by the vision
tower); this module projects + scatter-inserts them into the text
embedding and returns the fused tensor to be fed as decoder_input of
the GPTModel.
Module Contents#
Classes#
Language-model insert payload for image / multicrop-patch features. |
|
Word embedding + image-feature projection + |
Data#
API#
- bridge.models.stepfun.modelling_step37.image_insert_embedding.logger#
‘getLogger(…)’
- class bridge.models.stepfun.modelling_step37.image_insert_embedding.ImageForInsert#
Language-model insert payload for image / multicrop-patch features.
Lives on the model side (not the data side) because it’s part of the
Step37Model.forwardinput contract — the data subpackage re-exports it for downstream data-side imports... attribute:: insert_start_token
Token id after which the visual features are inserted (
<im_start>for image,<patch_start>for multicrop patches)... attribute:: images
Raw image tensor shaped
[N, 3, H, W]. Either this orimage_featuresis populated; the encoder pipeline consumesimagesand populatesimage_features... attribute:: image_features
Optional precomputed features before the language projector (
[N, L, C]). Used for the decoupled- encoder mode where the vision tower runs outside the decoder... attribute:: rope_cu_seqlens
Per-image patch cu_seqlens for visual RoPE (shape
[N + 1])... attribute:: rope_max_seq_len
Max patch count across all images in this
ImageForInsert(a Python int for serializability).- insert_start_token: int#
None
- images: Optional[torch.Tensor]#
None
- image_features: Optional[Union[torch.Tensor, list[torch.Tensor]]]#
None
- rope_cu_seqlens: Optional[torch.Tensor]#
None
- rope_max_seq_len: Optional[int]#
None
- class bridge.models.stepfun.modelling_step37.image_insert_embedding.ImageInsertEmbedding(
- language_embedding,
- encoder_output_dim: int,
- hidden_size: int,
- projector_bias: bool = False,
Bases:
torch.nn.ModuleWord embedding + image-feature projection +
<im_start>scatter-insert.Initialization
- static _normalize_feature_list(
- image_features: Union[torch.Tensor, list[torch.Tensor]],
- *,
- device: torch.device,
- dtype: torch.dtype,
- static insert_features(
- input_embeddings: torch.Tensor,
- image_features: Union[torch.Tensor, list[torch.Tensor]],
- input_ids: torch.Tensor,
- flag: int,
Scatter-insert projected image features at
<im_start>positions.Finds every position where
input_ids == flag(the<im_start>id), shifts by +1 to land on the first<im_patch>placeholder, and overwrites the nextfeature.shape[0]rows ofinput_embeddings(sequence-first[S, B, H]) with the provided image-feature rows.
- forward(
- input_ids: torch.IntTensor,
- images: Optional[list[bridge.models.stepfun.modelling_step37.image_insert_embedding.ImageForInsert]] = None,
- position_ids: Optional[torch.Tensor] = None,
- **kwargs,
Compute word embeddings and scatter-insert pre-encoded image features.
- Parameters:
input_ids –
[B, S]long token ids (placeholder positions live at theinsert_start_tokenof eachImageForInsert— typically<im_start>).images – list of :class:
ImageForInsert. Each item must haveimage_featurespopulated (the vision tower runs upstream, e.g. insideStep37Model._encode_images_for_insert).position_ids – forwarded to the underlying word embedding (None is accepted; Step-3.5’s per-layer rotary is computed inside the decoder, so the position arg here is normally ignored).
- Returns:
Fused embedding
[S, B, H](sequence-first), ready to feed into the GPT decoder viadecoder_input.
- bridge.models.stepfun.modelling_step37.image_insert_embedding.__all__#
[‘ImageForInsert’, ‘ImageInsertEmbedding’]