bridge.models.stepfun.modelling_step37.model#

Step3.7 multimodal model orchestrator.

Combines a vision tower, a vision-text fusion step, and a Step-3.5 text decoder. The forward path is:

  • Input: forward(input_ids, images: list[ImageForInsert], cu_seqlens, position_ids, attention_mask, labels, loss_mask, packed_seq_params, ...).

  • Vision encode: _encode_images_for_insert(images) runs the PE-G/14 trunk + both downsamplers per :class:ImageForInsert, populating image_features ([N, 169, encoder.output_dim]).

  • Vision-text fusion: :class:ImageInsertEmbedding (owns align_projector: encoder.output_dim β†’ hidden_size) projects the features and scatter-inserts them at each <im_start> (+1) via its insert_features algorithm.

  • The combined embedding is handed to the standard Step-3.5 text decoder via :class:Step37GPTModel.forward(decoder_input=...).

The model consumes list[ImageForInsert] directly; there are no pixel_values / image_grid_thw Qwen-VL-style kwargs.

Module Contents#

Classes#

Step37Model

Step3.7 multimodal model.

Data#

API#

class bridge.models.stepfun.modelling_step37.model.Step37Model(
language_transformer_config: megatron.bridge.models.stepfun.modelling_step37.transformer_config.Step37TransformerConfig,
language_transformer_layer_spec: megatron.core.transformer.spec_utils.ModuleSpec,
vision_transformer_config,
parallel_output: bool = True,
pre_process: bool = True,
post_process: bool = True,
add_encoder: bool = True,
add_decoder: bool = True,
pg_collection: Optional[megatron.core.process_groups_config.ProcessGroupCollection] = None,
mtp_block_spec: Optional[megatron.core.transformer.spec_utils.ModuleSpec] = None,
vp_stage: Optional[int] = None,
)#

Bases: megatron.core.transformer.MegatronModule

Step3.7 multimodal model.

Parameters:
  • language_transformer_config – Step3.7 TransformerConfig carrying both the Step-3.5 text-decoder fields and the multimodal fields (vision_config, image_token_id, understand_projector_stride, projector_bias).

  • language_transformer_layer_spec – Per-layer ModuleSpec for the text decoder β€” see modelling_step37/transformer_block.py.

  • vision_transformer_config – HF StepRoboticsVisionEncoderConfig describing the PE-G/14 trunk.

  • parallel_output – forwarded to :class:Step37GPTModel.

  • post_process (pre_process /) – standard PP-stage flags.

  • add_decoder (add_encoder /) – PP-stage gating for the vision and language modules. The vision tower is built only when both pre_process and add_encoder are true.

  • pg_collection – process-group bundle (uses MPU defaults if None).

  • mtp_block_spec – optional MTP block spec forwarded to GPTModel.

  • vp_stage – optional virtual-PP stage index.

Initialization

shared_embedding_or_output_weight()#
property decoder#
set_input_tensor(input_tensor) None#

Standard PP plumbing β€” encoder_hidden_state on pre_process ranks, otherwise forward to the language model.

freeze(
freeze_language_model: bool,
freeze_vision_model: bool,
freeze_vision_projection: bool,
)#

Freeze any combination of the language tower / vision tower / projector for fine-tuning scenarios.

_encode_images_for_insert(
images: Optional[list[megatron.bridge.models.stepfun.modelling_step37.image_insert_embedding.ImageForInsert]],
) Optional[list[megatron.bridge.models.stepfun.modelling_step37.image_insert_embedding.ImageForInsert]]#

Encode raw image pixels into vision features.

For each :class:ImageForInsert in images, runs the vision tower on its raw [N, 3, H, W] pixels (if image_features isn’t already populated) and returns a new ImageForInsert carrying the encoded [N, P, encoder.output_dim] features. The insert_start_token + RoPE metadata is preserved.

Vision runs in the same mesh as the decoder, so this is a single self.vision_model(pixels) call.

forward_head(
input_ids: torch.Tensor,
images: Optional[list[megatron.bridge.models.stepfun.modelling_step37.image_insert_embedding.ImageForInsert]] = None,
position_ids: Optional[torch.Tensor] = None,
**kwargs,
) torch.Tensor#

Compute the fused vision-text input embedding.

Delegates to :class:ImageInsertEmbedding to compute word-embedding

  • align_projector + insert_features scatter. Returns the fused [S, B, H] embedding ready for the decoder.

forward(
input_ids: torch.Tensor,
images: Optional[list[megatron.bridge.models.stepfun.modelling_step37.image_insert_embedding.ImageForInsert]] = None,
cu_seqlens: Optional[torch.Tensor] = None,
position_ids: Optional[torch.Tensor] = None,
attention_mask: Optional[torch.Tensor] = None,
labels: Optional[torch.Tensor] = None,
loss_mask: Optional[torch.Tensor] = None,
max_seq_len: Optional[torch.Tensor] = None,
packed_seq_params: Optional[megatron.core.packed_seq_params.PackedSeqParams] = None,
inference_params: Optional[megatron.core.InferenceParams] = None,
extra_block_kwargs: Optional[dict] = None,
inference_context: object | None = None,
runtime_gather_output: bool | None = None,
**kwargs,
) torch.Tensor#

Step3.7 forward.

Parameters:
  • input_ids – [1, T] packed token ids (the flickr8k pipeline always feeds B=1 because per-pack sub-sequences are demarcated by cu_seqlens).

  • images – pre-encoded or raw list[ImageForInsert]. Each item’s insert_start_token points at the placeholder token id (e.g. <im_start>) used by insert_features to locate the 169-token <im_patch> span.

  • cu_seqlens – [B_sub+1] int32 sub-sequence boundary array inside the packed row.

  • position_ids – optional per-sub-seq position ids (None lets the decoder layer’s RoPE module compute them internally).

  • packed_seq_params – pre-built FlashAttn varlen PackedSeqParams β€” the flickr8k forward step builds these from cu_seqlens.

  • max_seq_len (attention_mask / labels / loss_mask /) – standard multimodal SFT batch fields.

bridge.models.stepfun.modelling_step37.model.__all__#

[β€˜Step37Model’]