nemo_automodel.components.models.step3p7.vision_encoder

View as Markdown

Module Contents

Classes

NameDescription
EncoderLayerScalePer-channel residual scaling used when ls_init_value is set.
EncoderMLPFeed-forward network used inside each transformer block.
EncoderRope2DCacheable 2D rotary positional embedding.
EncoderVisionAttentionMulti-head self attention with optional 2D RoPE.
EncoderVisionBlockA single Vision Transformer block (self-attention + MLP).
EncoderVisionTransformerStack of encoder blocks parameterised by Step35VisionEncoderConfig.
StepRoboticsVisionEncoderVision encoder built from StepRoboticsVisionEncoderConfig.

Functions

NameDescription
apply_rotary_embApply 2D rotary embeddings to queries / keys.
rotate_halfRotate last dimension halves (used by RoPE).

API

class nemo_automodel.components.models.step3p7.vision_encoder.EncoderLayerScale(
dim: int,
init_values: float
)

Bases: Module

Per-channel residual scaling used when ls_init_value is set.

gamma
= nn.Parameter(torch.full((dim,), init_values))
nemo_automodel.components.models.step3p7.vision_encoder.EncoderLayerScale.forward(
hidden_states: torch.Tensor
) -> torch.Tensor
class nemo_automodel.components.models.step3p7.vision_encoder.EncoderMLP(
hidden_size: int,
intermediate_size: int,
hidden_act: str
)

Bases: Module

Feed-forward network used inside each transformer block.

act_fn
= ACT2FN[hidden_act]
c_fc
c_proj
nemo_automodel.components.models.step3p7.vision_encoder.EncoderMLP.forward(
hidden_states: torch.Tensor
) -> torch.Tensor
class nemo_automodel.components.models.step3p7.vision_encoder.EncoderRope2D(
dim: int,
max_grid_height: int,
max_grid_width: int,
use_cls_token: bool = False,
theta: typing.Union[int, float] = 10000,
max_freq: int = 10,
num_freqs: int = 1,
theta_rescale_factor: float = 1.0
)

Bases: Module

Cacheable 2D rotary positional embedding.

theta
= theta * theta_rescale_factor ** (dim / (dim - 2))
nemo_automodel.components.models.step3p7.vision_encoder.EncoderRope2D._compute_2d_freqs() -> torch.Tensor
nemo_automodel.components.models.step3p7.vision_encoder.EncoderRope2D._compute_freqs(
t: torch.Tensor,
inv_freq: torch.Tensor
)
nemo_automodel.components.models.step3p7.vision_encoder.EncoderRope2D._compute_inv_freq(
base: typing.Union[int, float],
dim: int
) -> torch.Tensor
nemo_automodel.components.models.step3p7.vision_encoder.EncoderRope2D.forward(
q: torch.Tensor,
k: torch.Tensor,
grid_hw: tuple[int, int]
)
class nemo_automodel.components.models.step3p7.vision_encoder.EncoderVisionAttention(
hidden_size: int,
num_heads: int,
max_grid_height: int,
max_grid_width: int,
use_cls_token: bool = False,
use_rope2d: bool = True,
rope_theta: typing.Union[int, float] = 10000,
rope_max_freq: int = 10,
rope_num_freqs: int = 1,
rope_theta_rescale_factor: float = 1.0,
rope_freqs_for: typing.Literal['lang', 'pixel', 'constant'] = 'lang'
)

Bases: Module

Multi-head self attention with optional 2D RoPE.

head_dim
= hidden_size // num_heads
in_proj_bias
= nn.Parameter(torch.zeros(hidden_size * 3))
in_proj_weight
out_proj
= nn.Linear(hidden_size, hidden_size, bias=True)
scale
= self.head_dim ** -0.5
nemo_automodel.components.models.step3p7.vision_encoder.EncoderVisionAttention.forward(
hidden_states: torch.Tensor,
grid_hw: tuple[int, int]
) -> torch.Tensor
class nemo_automodel.components.models.step3p7.vision_encoder.EncoderVisionBlock(
hidden_size: int,
num_heads: int,
mlp_ratio: float,
hidden_act: str,
layer_norm_eps: float,
ls_init_value: typing.Optional[float] = None,
max_grid_height: typing.Optional[int] = None,
max_grid_width: typing.Optional[int] = None,
use_cls_token: bool = False,
use_rope2d: bool = True,
rope_kwargs: typing.Optional[dict] = None
)

Bases: Module

A single Vision Transformer block (self-attention + MLP).

attn
ln_1
= nn.LayerNorm(hidden_size, eps=layer_norm_eps)
ln_2
= nn.LayerNorm(hidden_size, eps=layer_norm_eps)
ls_1
= EncoderLayerScale(hidden_size, ls_init_value)
ls_2
= EncoderLayerScale(hidden_size, ls_init_value)
mlp
= EncoderMLP(hidden_size, intermediate, hidden_act)
nemo_automodel.components.models.step3p7.vision_encoder.EncoderVisionBlock.forward(
hidden_states: torch.Tensor,
grid_hw: tuple[int, int]
) -> torch.Tensor
class nemo_automodel.components.models.step3p7.vision_encoder.EncoderVisionTransformer(
embed_dim: int,
depth: int,
num_heads: int,
mlp_ratio: float,
hidden_act: str,
layer_norm_eps: float,
ls_init_value: typing.Optional[float] = None,
max_grid_height: typing.Optional[int] = None,
max_grid_width: typing.Optional[int] = None,
use_cls_token: bool = False,
use_rope2d: bool = True,
rope_kwargs: typing.Optional[dict] = None
)

Bases: Module

Stack of encoder blocks parameterised by Step35VisionEncoderConfig.

resblocks
nemo_automodel.components.models.step3p7.vision_encoder.EncoderVisionTransformer.forward(
hidden_states: torch.Tensor,
grid_hw: tuple[int, int]
) -> torch.Tensor
class nemo_automodel.components.models.step3p7.vision_encoder.StepRoboticsVisionEncoder(
config: nemo_automodel.components.models.step3p7.configuration_step3p7.StepRoboticsVisionEncoderConfig
)

Bases: Module

Vision encoder built from StepRoboticsVisionEncoderConfig.

The encoder performs patch embedding followed by a stack of transformer blocks. Only the config fields defined in StepRoboticsVisionEncoderConfig (and StepRoboticVLConfig.vision_config) are expected.

base_grid
= (grid_size, grid_size)
class_embedding
conv1
hidden_act
= config.hidden_act
hidden_size
= config.width
image_size
= config.image_size
layer_norm_eps
= config.layer_norm_eps
ln_post
ln_pre
ls_init_value
= getattr(config, 'ls_init_value', None)
mlp_ratio
= getattr(config, 'mlp_ratio', 8960 / 1536)
num_heads
= config.heads
num_hidden_layers
= config.layers
patch_size
= config.patch_size
posemb_grid_size
= self.image_size // self.patch_size
positional_embedding
transformer
use_abs_posemb
= getattr(config, 'use_abs_posemb', True)
use_cls_token
= getattr(config, 'use_cls_token', False)
use_ln_post
= getattr(config, 'use_ln_post', True)
use_ln_pre
= getattr(config, 'use_ln_pre', False)
use_rope2d
= getattr(config, 'use_rope2d', True)
vit_downsampler1
vit_downsampler2
nemo_automodel.components.models.step3p7.vision_encoder.StepRoboticsVisionEncoder.forward(
pixel_values: torch.Tensor
) -> torch.Tensor

Parameters:

pixel_values
torch.Tensor

Image tensor of shape (B, C, H, W).

layer_idx

Negative indices stop after a given block (e.g., -1 uses all blocks).

strip_cls_token

If True and cls token is used, remove it from output.

nemo_automodel.components.models.step3p7.vision_encoder.StepRoboticsVisionEncoder.sample_abs_posemb(
grid_h: int,
grid_w: int
)
nemo_automodel.components.models.step3p7.vision_encoder.apply_rotary_emb(
freqs: torch.Tensor,
t: torch.Tensor,
start_index: int = 0,
scale: float = 1.0,
seq_dim: int = -2
) -> torch.Tensor

Apply 2D rotary embeddings to queries / keys.

nemo_automodel.components.models.step3p7.vision_encoder.rotate_half(
x: torch.Tensor
) -> torch.Tensor

Rotate last dimension halves (used by RoPE).