nemo_automodel.components.models.step3p7.vision_encoder#
Module Contents#
Classes#
Cacheable 2D rotary positional embedding. |
|
Per-channel residual scaling used when ls_init_value is set. |
|
Feed-forward network used inside each transformer block. |
|
Multi-head self attention with optional 2D RoPE. |
|
A single Vision Transformer block (self-attention + MLP). |
|
Stack of encoder blocks parameterised by Step35VisionEncoderConfig. |
|
Vision encoder built from StepRoboticsVisionEncoderConfig. |
Functions#
Rotate last dimension halves (used by RoPE). |
|
Apply 2D rotary embeddings to queries / keys. |
API#
- nemo_automodel.components.models.step3p7.vision_encoder.rotate_half(x: torch.Tensor) torch.Tensor#
Rotate last dimension halves (used by RoPE).
- nemo_automodel.components.models.step3p7.vision_encoder.apply_rotary_emb(
- freqs: torch.Tensor,
- t: torch.Tensor,
- start_index: int = 0,
- scale: float = 1.0,
- seq_dim: int = -2,
Apply 2D rotary embeddings to queries / keys.
- class nemo_automodel.components.models.step3p7.vision_encoder.EncoderRope2D(
- dim: int,
- max_grid_height: int,
- max_grid_width: int,
- use_cls_token: bool = False,
- theta: Union[int, float] = 10000,
- max_freq: int = 10,
- num_freqs: int = 1,
- theta_rescale_factor: float = 1.0,
Bases:
torch.nn.ModuleCacheable 2D rotary positional embedding.
Initialization
- _compute_inv_freq(
- base: Union[int, float],
- dim: int,
- _compute_freqs(t: torch.Tensor, inv_freq: torch.Tensor)#
- _compute_2d_freqs() torch.Tensor#
- forward(q: torch.Tensor, k: torch.Tensor, grid_hw: tuple[int, int])#
- class nemo_automodel.components.models.step3p7.vision_encoder.EncoderLayerScale(dim: int, init_values: float)#
Bases:
torch.nn.ModulePer-channel residual scaling used when ls_init_value is set.
Initialization
- forward(hidden_states: torch.Tensor) torch.Tensor#
- class nemo_automodel.components.models.step3p7.vision_encoder.EncoderMLP(hidden_size: int, intermediate_size: int, hidden_act: str)#
Bases:
torch.nn.ModuleFeed-forward network used inside each transformer block.
Initialization
- forward(hidden_states: torch.Tensor) torch.Tensor#
- class nemo_automodel.components.models.step3p7.vision_encoder.EncoderVisionAttention(
- hidden_size: int,
- num_heads: int,
- max_grid_height: int,
- max_grid_width: int,
- use_cls_token: bool = False,
- use_rope2d: bool = True,
- rope_theta: Union[int, float] = 10000,
- rope_max_freq: int = 10,
- rope_num_freqs: int = 1,
- rope_theta_rescale_factor: float = 1.0,
- rope_freqs_for: Literal[lang, pixel, constant] = 'lang',
Bases:
torch.nn.ModuleMulti-head self attention with optional 2D RoPE.
Initialization
- forward(
- hidden_states: torch.Tensor,
- grid_hw: tuple[int, int],
- class nemo_automodel.components.models.step3p7.vision_encoder.EncoderVisionBlock(
- hidden_size: int,
- num_heads: int,
- mlp_ratio: float,
- hidden_act: str,
- layer_norm_eps: float,
- ls_init_value: Optional[float] = None,
- max_grid_height: Optional[int] = None,
- max_grid_width: Optional[int] = None,
- use_cls_token: bool = False,
- use_rope2d: bool = True,
- rope_kwargs: Optional[dict] = None,
Bases:
torch.nn.ModuleA single Vision Transformer block (self-attention + MLP).
Initialization
- forward(
- hidden_states: torch.Tensor,
- grid_hw: tuple[int, int],
- class nemo_automodel.components.models.step3p7.vision_encoder.EncoderVisionTransformer(
- embed_dim: int,
- depth: int,
- num_heads: int,
- mlp_ratio: float,
- hidden_act: str,
- layer_norm_eps: float,
- ls_init_value: Optional[float] = None,
- max_grid_height: Optional[int] = None,
- max_grid_width: Optional[int] = None,
- use_cls_token: bool = False,
- use_rope2d: bool = True,
- rope_kwargs: Optional[dict] = None,
Bases:
torch.nn.ModuleStack of encoder blocks parameterised by Step35VisionEncoderConfig.
Initialization
- forward(
- hidden_states: torch.Tensor,
- grid_hw: tuple[int, int],
- class nemo_automodel.components.models.step3p7.vision_encoder.StepRoboticsVisionEncoder(
- config: nemo_automodel.components.models.step3p7.configuration_step3p7.StepRoboticsVisionEncoderConfig,
Bases:
torch.nn.ModuleVision encoder built from StepRoboticsVisionEncoderConfig.
The encoder performs patch embedding followed by a stack of transformer blocks. Only the config fields defined in StepRoboticsVisionEncoderConfig (and StepRoboticVLConfig.vision_config) are expected.
Initialization
- sample_abs_posemb(grid_h: int, grid_w: int)#
- forward(pixel_values: torch.Tensor) torch.Tensor#
- Parameters:
pixel_values – Image tensor of shape (B, C, H, W).
layer_idx – Negative indices stop after a given block (e.g., -1 uses all blocks).
strip_cls_token – If True and cls token is used, remove it from output.