nemo_automodel.components.models.step3p7.vision_encoder
nemo_automodel.components.models.step3p7.vision_encoder
Module Contents
Classes
Functions
API
Bases: Module
Per-channel residual scaling used when ls_init_value is set.
Bases: Module
Feed-forward network used inside each transformer block.
Bases: Module
Cacheable 2D rotary positional embedding.
Bases: Module
Multi-head self attention with optional 2D RoPE.
Bases: Module
A single Vision Transformer block (self-attention + MLP).
Bases: Module
Stack of encoder blocks parameterised by Step35VisionEncoderConfig.
Bases: Module
Vision encoder built from StepRoboticsVisionEncoderConfig.
The encoder performs patch embedding followed by a stack of transformer blocks. Only the config fields defined in StepRoboticsVisionEncoderConfig (and StepRoboticVLConfig.vision_config) are expected.
Parameters:
Image tensor of shape (B, C, H, W).
Negative indices stop after a given block (e.g., -1 uses all blocks).
If True and cls token is used, remove it from output.
Apply 2D rotary embeddings to queries / keys.
Rotate last dimension halves (used by RoPE).