bridge.models.stepfun.modelling_step37.vision_model#

Step3.7 vision tower (Perception-Encoder G/14 + downsamplers).

Module names mirror vision_model.* in the HF Step37Model checkpoint so safetensors weights can be loaded by direct AutoMapping:

vision_model.conv1.weight
vision_model.ln_pre.{weight,bias}
vision_model.positional_embedding
vision_model.transformer.resblocks.{N}.attn.{in_proj_weight,in_proj_bias,
                                              out_proj.{weight,bias}}
vision_model.transformer.resblocks.{N}.ln_{1,2}.{weight,bias}
vision_model.transformer.resblocks.{N}.ls_{1,2}.gamma
vision_model.transformer.resblocks.{N}.mlp.{c_fc,c_proj}.{weight,bias}
vision_model.vit_downsampler{1,2}.{weight,bias}

Module Contents#

Classes#

Step37VisionModel

Perception-Encoder G/14 vision tower used by Step3.7.

API#

class bridge.models.stepfun.modelling_step37.vision_model.Step37VisionModel(vision_config)#

Bases: torch.nn.Module

Perception-Encoder G/14 vision tower used by Step3.7.

The module layout and parameter names match the HF StepRoboticsVisionEncoder checkpoint, which is what makes the Megatron-Bridge weight loader a direct AutoMapping for every vision parameter. The two vit_downsampler convolutions live on this module (matching the HF safetensors); forward runs the whole PE-G/14 trunk plus both downsamplers and returns [N, P', output_dim] in one call. The final vit_large_projector linear is owned by

Class:

ImageInsertEmbedding (in image_insert_embedding.py) and is applied during the embedding/fusion step, not in the vision tower.

Initialization

sample_abs_posemb(grid_h: int, grid_w: int)#
forward(pixel_values: torch.Tensor) torch.Tensor#

Run the PE-G/14 trunk + both downsamplers in one call.

Steps: conv1 patchify β†’ optional CLS β†’ optional abs-pos-emb β†’ ln_pre β†’ 47Γ—VisionBlock β†’ optional ln_post β†’ drop CLS β†’ reshape to spatial β†’ vit_downsampler1 (3Γ—3, stride 2) β†’ vit_downsampler2 (3Γ—3, stride 2) β†’ flatten + transpose.

Parameters:

pixel_values – float tensor of shape [B, C, H, W] with H = W = image_size (728 for the released checkpoint).

Returns:

Tensor of shape [B, P', output_dim] (e.g. [B, 169, 6144] for 728Β² inputs through the released checkpoint). P' is (Gh/4)*(Gw/4) β€” the spatial grid after two stride-2 downsamplers β€” and output_dim is vit_downsampler2’s output channel count.