`nemo_automodel.components.models.step3p7.vision_encoder`#

Module Contents#

Classes#

`EncoderRope2D`	Cacheable 2D rotary positional embedding.
`EncoderLayerScale`	Per-channel residual scaling used when ls_init_value is set.
`EncoderMLP`	Feed-forward network used inside each transformer block.
`EncoderVisionAttention`	Multi-head self attention with optional 2D RoPE.
`EncoderVisionBlock`	A single Vision Transformer block (self-attention + MLP).
`EncoderVisionTransformer`	Stack of encoder blocks parameterised by Step35VisionEncoderConfig.
`StepRoboticsVisionEncoder`	Vision encoder built from StepRoboticsVisionEncoderConfig.

Functions#

`rotate_half`	Rotate last dimension halves (used by RoPE).
`apply_rotary_emb`	Apply 2D rotary embeddings to queries / keys.

API#

nemo_automodel.components.models.step3p7.vision_encoder.rotate_half(x: torch.Tensor) → torch.Tensor#: Rotate last dimension halves (used by RoPE).

nemo_automodel.components.models.step3p7.vision_encoder.apply_rotary_emb( freqs: torch.Tensor, t: torch.Tensor, start_index: int = 0, scale: float = 1.0, seq_dim: int = -2, ) → torch.Tensor#: Apply 2D rotary embeddings to queries / keys.

class nemo_automodel.components.models.step3p7.vision_encoder.EncoderRope2D( dim: int, max_grid_height: int, max_grid_width: int, use_cls_token: bool = False, theta: Union[int, float] = 10000, max_freq: int = 10, num_freqs: int = 1, theta_rescale_factor: float = 1.0, )#

Bases: torch.nn.Module

Cacheable 2D rotary positional embedding.

Initialization

_compute_inv_freq( base: Union[int, float], dim: int, ) → torch.Tensor#

_compute_freqs(t: torch.Tensor, inv_freq: torch.Tensor)#

_compute_2d_freqs() → torch.Tensor#

forward(q: torch.Tensor, k: torch.Tensor, grid_hw: tuple[int, int])#

class nemo_automodel.components.models.step3p7.vision_encoder.EncoderLayerScale(dim: int, init_values: float)#

Bases: torch.nn.Module

Per-channel residual scaling used when ls_init_value is set.

Initialization

forward(hidden_states: torch.Tensor) → torch.Tensor#

class nemo_automodel.components.models.step3p7.vision_encoder.EncoderMLP(hidden_size: int, intermediate_size: int, hidden_act: str)#

Bases: torch.nn.Module

Feed-forward network used inside each transformer block.

Initialization

forward(hidden_states: torch.Tensor) → torch.Tensor#

class nemo_automodel.components.models.step3p7.vision_encoder.EncoderVisionAttention( hidden_size: int, num_heads: int, max_grid_height: int, max_grid_width: int, use_cls_token: bool = False, use_rope2d: bool = True, rope_theta: Union[int, float] = 10000, rope_max_freq: int = 10, rope_num_freqs: int = 1, rope_theta_rescale_factor: float = 1.0, rope_freqs_for: Literal[lang, pixel, constant] = 'lang', )#

Bases: torch.nn.Module

Multi-head self attention with optional 2D RoPE.

Initialization

forward( hidden_states: torch.Tensor, grid_hw: tuple[int, int], ) → torch.Tensor#

class nemo_automodel.components.models.step3p7.vision_encoder.EncoderVisionBlock( hidden_size: int, num_heads: int, mlp_ratio: float, hidden_act: str, layer_norm_eps: float, ls_init_value: Optional[float] = None, max_grid_height: Optional[int] = None, max_grid_width: Optional[int] = None, use_cls_token: bool = False, use_rope2d: bool = True, rope_kwargs: Optional[dict] = None, )#

Bases: torch.nn.Module

A single Vision Transformer block (self-attention + MLP).

Initialization

forward( hidden_states: torch.Tensor, grid_hw: tuple[int, int], ) → torch.Tensor#

class nemo_automodel.components.models.step3p7.vision_encoder.EncoderVisionTransformer( embed_dim: int, depth: int, num_heads: int, mlp_ratio: float, hidden_act: str, layer_norm_eps: float, ls_init_value: Optional[float] = None, max_grid_height: Optional[int] = None, max_grid_width: Optional[int] = None, use_cls_token: bool = False, use_rope2d: bool = True, rope_kwargs: Optional[dict] = None, )#

Bases: torch.nn.Module

Stack of encoder blocks parameterised by Step35VisionEncoderConfig.

Initialization

forward( hidden_states: torch.Tensor, grid_hw: tuple[int, int], ) → torch.Tensor#

class nemo_automodel.components.models.step3p7.vision_encoder.StepRoboticsVisionEncoder( config: nemo_automodel.components.models.step3p7.configuration_step3p7.StepRoboticsVisionEncoderConfig, )#

Bases: torch.nn.Module

Vision encoder built from StepRoboticsVisionEncoderConfig.

The encoder performs patch embedding followed by a stack of transformer blocks. Only the config fields defined in StepRoboticsVisionEncoderConfig (and StepRoboticVLConfig.vision_config) are expected.

Initialization

sample_abs_posemb(grid_h: int, grid_w: int)#

forward(pixel_values: torch.Tensor) → torch.Tensor#

Parameters:

pixel_values – Image tensor of shape (B, C, H, W).
layer_idx – Negative indices stop after a given block (e.g., -1 uses all blocks).
strip_cls_token – If True and cls token is used, remove it from output.

nemo_automodel.components.models.step3p7.vision_encoder#

Module Contents#

Classes#

Functions#

API#

`nemo_automodel.components.models.step3p7.vision_encoder`#