`bridge.models.stepfun.modelling_step37.vision_model`#

Step3.7 vision tower (Perception-Encoder G/14 + downsamplers).

Module names mirror vision_model.* in the HF Step37Model checkpoint so safetensors weights can be loaded by direct AutoMapping:

vision_model.conv1.weight
vision_model.ln_pre.{weight,bias}
vision_model.positional_embedding
vision_model.transformer.resblocks.{N}.attn.{in_proj_weight,in_proj_bias,
                                              out_proj.{weight,bias}}
vision_model.transformer.resblocks.{N}.ln_{1,2}.{weight,bias}
vision_model.transformer.resblocks.{N}.ls_{1,2}.gamma
vision_model.transformer.resblocks.{N}.mlp.{c_fc,c_proj}.{weight,bias}
vision_model.vit_downsampler{1,2}.{weight,bias}

Module Contents#

Classes#

Step37VisionModel

Perception-Encoder G/14 vision tower used by Step3.7.

API#

class bridge.models.stepfun.modelling_step37.vision_model.Step37VisionModel(vision_config)#

Bases: torch.nn.Module

Perception-Encoder G/14 vision tower used by Step3.7.

The module layout and parameter names match the HF StepRoboticsVisionEncoder checkpoint, which is what makes the Megatron-Bridge weight loader a direct AutoMapping for every vision parameter. The two vit_downsampler convolutions live on this module (matching the HF safetensors); forward runs the whole PE-G/14 trunk plus both downsamplers and returns [N, P', output_dim] in one call. The final vit_large_projector linear is owned by

Class:: ImageInsertEmbedding (in image_insert_embedding.py) and is applied during the embedding/fusion step, not in the vision tower.

Initialization

sample_abs_posemb(grid_h: int, grid_w: int)#

forward(pixel_values: torch.Tensor) → torch.Tensor#

Run the PE-G/14 trunk + both downsamplers in one call.

Steps: conv1 patchify → optional CLS → optional abs-pos-emb → ln_pre → 47×VisionBlock → optional ln_post → drop CLS → reshape to spatial → vit_downsampler1 (3×3, stride 2) → vit_downsampler2 (3×3, stride 2) → flatten + transpose.

Parameters:: pixel_values – float tensor of shape [B, C, H, W] with H = W = image_size (728 for the released checkpoint).
Returns:: Tensor of shape [B, P', output_dim] (e.g. [B, 169, 6144] for 728² inputs through the released checkpoint). P' is (Gh/4)*(Gw/4) — the spatial grid after two stride-2 downsamplers — and output_dim is vit_downsampler2’s output channel count.

bridge.models.stepfun.modelling_step37.vision_model#

Module Contents#

Classes#

API#

`bridge.models.stepfun.modelling_step37.vision_model`#