`bridge.models.stepfun.modelling_step37.model`#

Step3.7 multimodal model orchestrator.

Combines a vision tower, a vision-text fusion step, and a Step-3.5 text decoder. The forward path is:

Input: forward(input_ids, images: list[ImageForInsert], cu_seqlens, position_ids, attention_mask, labels, loss_mask, packed_seq_params, ...).
Vision encode: _encode_images_for_insert(images) runs the PE-G/14 trunk + both downsamplers per :class:ImageForInsert, populating image_features ([N, 169, encoder.output_dim]).
Vision-text fusion: :class:ImageInsertEmbedding (owns align_projector: encoder.output_dim → hidden_size) projects the features and scatter-inserts them at each <im_start> (+1) via its insert_features algorithm.
The combined embedding is handed to the standard Step-3.5 text decoder via :class:Step37GPTModel.forward(decoder_input=...).

The model consumes list[ImageForInsert] directly; there are no pixel_values / image_grid_thw Qwen-VL-style kwargs.

Module Contents#

Classes#

Step37Model

Step3.7 multimodal model.

Data#

__all__

API#

class bridge.models.stepfun.modelling_step37.model.Step37Model( language_transformer_config: megatron.bridge.models.stepfun.modelling_step37.transformer_config.Step37TransformerConfig, language_transformer_layer_spec: megatron.core.transformer.spec_utils.ModuleSpec, vision_transformer_config, parallel_output: bool = True, pre_process: bool = True, post_process: bool = True, add_encoder: bool = True, add_decoder: bool = True, pg_collection: Optional[megatron.core.process_groups_config.ProcessGroupCollection] = None, mtp_block_spec: Optional[megatron.core.transformer.spec_utils.ModuleSpec] = None, vp_stage: Optional[int] = None, )#

Bases: megatron.core.transformer.MegatronModule

Step3.7 multimodal model.

Parameters:

language_transformer_config – Step3.7 TransformerConfig carrying both the Step-3.5 text-decoder fields and the multimodal fields (vision_config, image_token_id, understand_projector_stride, projector_bias).
language_transformer_layer_spec – Per-layer ModuleSpec for the text decoder — see modelling_step37/transformer_block.py.
vision_transformer_config – HF StepRoboticsVisionEncoderConfig describing the PE-G/14 trunk.
parallel_output – forwarded to :class:Step37GPTModel.
post_process (pre_process /) – standard PP-stage flags.
add_decoder (add_encoder /) – PP-stage gating for the vision and language modules. The vision tower is built only when both pre_process and add_encoder are true.
pg_collection – process-group bundle (uses MPU defaults if None).
mtp_block_spec – optional MTP block spec forwarded to GPTModel.
vp_stage – optional virtual-PP stage index.

Initialization

shared_embedding_or_output_weight()#

property decoder#

set_input_tensor(input_tensor) → None#: Standard PP plumbing — encoder_hidden_state on pre_process ranks, otherwise forward to the language model.

freeze( freeze_language_model: bool, freeze_vision_model: bool, freeze_vision_projection: bool, )#: Freeze any combination of the language tower / vision tower / projector for fine-tuning scenarios.

_encode_images_for_insert( images: Optional[list[megatron.bridge.models.stepfun.modelling_step37.image_insert_embedding.ImageForInsert]], ) → Optional[list[megatron.bridge.models.stepfun.modelling_step37.image_insert_embedding.ImageForInsert]]#

Encode raw image pixels into vision features.

For each :class:ImageForInsert in images, runs the vision tower on its raw [N, 3, H, W] pixels (if image_features isn’t already populated) and returns a new ImageForInsert carrying the encoded [N, P, encoder.output_dim] features. The insert_start_token + RoPE metadata is preserved.

Vision runs in the same mesh as the decoder, so this is a single self.vision_model(pixels) call.

forward_head(

input_ids: torch.Tensor,

images: Optional[list[megatron.bridge.models.stepfun.modelling_step37.image_insert_embedding.ImageForInsert]] = None,

position_ids: Optional[torch.Tensor] = None,

**kwargs,

) → torch.Tensor#

Compute the fused vision-text input embedding.

Delegates to :class:ImageInsertEmbedding to compute word-embedding

align_projector + insert_features scatter. Returns the fused [S, B, H] embedding ready for the decoder.

forward(

input_ids: torch.Tensor,

images: Optional[list[megatron.bridge.models.stepfun.modelling_step37.image_insert_embedding.ImageForInsert]] = None,

cu_seqlens: Optional[torch.Tensor] = None,

position_ids: Optional[torch.Tensor] = None,

attention_mask: Optional[torch.Tensor] = None,

labels: Optional[torch.Tensor] = None,

loss_mask: Optional[torch.Tensor] = None,

max_seq_len: Optional[torch.Tensor] = None,

packed_seq_params: Optional[megatron.core.packed_seq_params.PackedSeqParams] = None,

inference_params: Optional[megatron.core.InferenceParams] = None,

extra_block_kwargs: Optional[dict] = None,

inference_context: object | None = None,

runtime_gather_output: bool | None = None,

**kwargs,

) → torch.Tensor#

Step3.7 forward.