bridge.models.stepfun.modelling_step37.vision_model#
Step3.7 vision tower (Perception-Encoder G/14 + downsamplers).
Module names mirror vision_model.* in the HF Step37Model
checkpoint so safetensors weights can be loaded by direct AutoMapping:
vision_model.conv1.weight
vision_model.ln_pre.{weight,bias}
vision_model.positional_embedding
vision_model.transformer.resblocks.{N}.attn.{in_proj_weight,in_proj_bias,
out_proj.{weight,bias}}
vision_model.transformer.resblocks.{N}.ln_{1,2}.{weight,bias}
vision_model.transformer.resblocks.{N}.ls_{1,2}.gamma
vision_model.transformer.resblocks.{N}.mlp.{c_fc,c_proj}.{weight,bias}
vision_model.vit_downsampler{1,2}.{weight,bias}
Module Contents#
Classes#
Perception-Encoder G/14 vision tower used by Step3.7. |
API#
- class bridge.models.stepfun.modelling_step37.vision_model.Step37VisionModel(vision_config)#
Bases:
torch.nn.ModulePerception-Encoder G/14 vision tower used by Step3.7.
The module layout and parameter names match the HF
StepRoboticsVisionEncodercheckpoint, which is what makes the Megatron-Bridge weight loader a direct AutoMapping for every vision parameter. The twovit_downsamplerconvolutions live on this module (matching the HF safetensors);forwardruns the whole PE-G/14 trunk plus both downsamplers and returns[N, P', output_dim]in one call. The finalvit_large_projectorlinear is owned by- Class:
ImageInsertEmbedding(inimage_insert_embedding.py) and is applied during the embedding/fusion step, not in the vision tower.
Initialization
- sample_abs_posemb(grid_h: int, grid_w: int)#
- forward(pixel_values: torch.Tensor) torch.Tensor#
Run the PE-G/14 trunk + both downsamplers in one call.
Steps: conv1 patchify β optional CLS β optional abs-pos-emb β
ln_preβ 47ΓVisionBlock β optionalln_postβ drop CLS β reshape to spatial βvit_downsampler1(3Γ3, stride 2) βvit_downsampler2(3Γ3, stride 2) β flatten + transpose.- Parameters:
pixel_values β float tensor of shape
[B, C, H, W]withH = W = image_size(728 for the released checkpoint).- Returns:
Tensor of shape
[B, P', output_dim](e.g.[B, 169, 6144]for 728Β² inputs through the released checkpoint).P'is(Gh/4)*(Gw/4)β the spatial grid after two stride-2 downsamplers β andoutput_dimisvit_downsampler2βs output channel count.