bridge.models.stepfun.modelling_step37.model#
Step3.7 multimodal model orchestrator.
Combines a vision tower, a vision-text fusion step, and a Step-3.5 text decoder. The forward path is:
Input:
forward(input_ids, images: list[ImageForInsert], cu_seqlens, position_ids, attention_mask, labels, loss_mask, packed_seq_params, ...).Vision encode:
_encode_images_for_insert(images)runs the PE-G/14 trunk + both downsamplers per :class:ImageForInsert, populatingimage_features([N, 169, encoder.output_dim]).Vision-text fusion: :class:
ImageInsertEmbedding(ownsalign_projector:encoder.output_dim β hidden_size) projects the features and scatter-inserts them at each<im_start>(+1) via itsinsert_featuresalgorithm.The combined embedding is handed to the standard Step-3.5 text decoder via :class:
Step37GPTModel.forward(decoder_input=...).
The model consumes list[ImageForInsert] directly; there are no
pixel_values / image_grid_thw Qwen-VL-style kwargs.
Module Contents#
Classes#
Step3.7 multimodal model. |
Data#
API#
- class bridge.models.stepfun.modelling_step37.model.Step37Model(
- language_transformer_config: megatron.bridge.models.stepfun.modelling_step37.transformer_config.Step37TransformerConfig,
- language_transformer_layer_spec: megatron.core.transformer.spec_utils.ModuleSpec,
- vision_transformer_config,
- parallel_output: bool = True,
- pre_process: bool = True,
- post_process: bool = True,
- add_encoder: bool = True,
- add_decoder: bool = True,
- pg_collection: Optional[megatron.core.process_groups_config.ProcessGroupCollection] = None,
- mtp_block_spec: Optional[megatron.core.transformer.spec_utils.ModuleSpec] = None,
- vp_stage: Optional[int] = None,
Bases:
megatron.core.transformer.MegatronModuleStep3.7 multimodal model.
- Parameters:
language_transformer_config β Step3.7
TransformerConfigcarrying both the Step-3.5 text-decoder fields and the multimodal fields (vision_config,image_token_id,understand_projector_stride,projector_bias).language_transformer_layer_spec β Per-layer
ModuleSpecfor the text decoder β seemodelling_step37/transformer_block.py.vision_transformer_config β HF
StepRoboticsVisionEncoderConfigdescribing the PE-G/14 trunk.parallel_output β forwarded to :class:
Step37GPTModel.post_process (pre_process /) β standard PP-stage flags.
add_decoder (add_encoder /) β PP-stage gating for the vision and language modules. The vision tower is built only when both
pre_processandadd_encoderare true.pg_collection β process-group bundle (uses MPU defaults if
None).mtp_block_spec β optional MTP block spec forwarded to GPTModel.
vp_stage β optional virtual-PP stage index.
Initialization
- property decoder#
- set_input_tensor(input_tensor) None#
Standard PP plumbing β encoder_hidden_state on pre_process ranks, otherwise forward to the language model.
- freeze(
- freeze_language_model: bool,
- freeze_vision_model: bool,
- freeze_vision_projection: bool,
Freeze any combination of the language tower / vision tower / projector for fine-tuning scenarios.
- _encode_images_for_insert(
- images: Optional[list[megatron.bridge.models.stepfun.modelling_step37.image_insert_embedding.ImageForInsert]],
Encode raw image pixels into vision features.
For each :class:
ImageForInsertinimages, runs the vision tower on its raw[N, 3, H, W]pixels (ifimage_featuresisnβt already populated) and returns a newImageForInsertcarrying the encoded[N, P, encoder.output_dim]features. Theinsert_start_token+ RoPE metadata is preserved.Vision runs in the same mesh as the decoder, so this is a single
self.vision_model(pixels)call.
- forward_head(
- input_ids: torch.Tensor,
- images: Optional[list[megatron.bridge.models.stepfun.modelling_step37.image_insert_embedding.ImageForInsert]] = None,
- position_ids: Optional[torch.Tensor] = None,
- **kwargs,
Compute the fused vision-text input embedding.
Delegates to :class:
ImageInsertEmbeddingto compute word-embeddingalign_projector+insert_featuresscatter. Returns the fused[S, B, H]embedding ready for the decoder.
- forward(
- input_ids: torch.Tensor,
- images: Optional[list[megatron.bridge.models.stepfun.modelling_step37.image_insert_embedding.ImageForInsert]] = None,
- cu_seqlens: Optional[torch.Tensor] = None,
- position_ids: Optional[torch.Tensor] = None,
- attention_mask: Optional[torch.Tensor] = None,
- labels: Optional[torch.Tensor] = None,
- loss_mask: Optional[torch.Tensor] = None,
- max_seq_len: Optional[torch.Tensor] = None,
- packed_seq_params: Optional[megatron.core.packed_seq_params.PackedSeqParams] = None,
- inference_params: Optional[megatron.core.InferenceParams] = None,
- extra_block_kwargs: Optional[dict] = None,
- inference_context: object | None = None,
- runtime_gather_output: bool | None = None,
- **kwargs,
Step3.7 forward.
- Parameters:
input_ids β
[1, T]packed token ids (the flickr8k pipeline always feedsB=1because per-pack sub-sequences are demarcated bycu_seqlens).images β pre-encoded or raw
list[ImageForInsert]. Each itemβsinsert_start_tokenpoints at the placeholder token id (e.g.<im_start>) used byinsert_featuresto locate the 169-token<im_patch>span.cu_seqlens β
[B_sub+1]int32 sub-sequence boundary array inside the packed row.position_ids β optional per-sub-seq position ids (
Nonelets the decoder layerβs RoPE module compute them internally).packed_seq_params β pre-built FlashAttn varlen
PackedSeqParamsβ the flickr8k forward step builds these fromcu_seqlens.max_seq_len (attention_mask / labels / loss_mask /) β standard multimodal SFT batch fields.
- bridge.models.stepfun.modelling_step37.model.__all__#
[βStep37Modelβ]