nemo_automodel.components.models.llava_onevision.rice_vit#
Rice Vision Transformer for LLaVA-OneVision-1.5.
Ported from lmms-lab/LLaVA-OneVision-1.5’s modeling_llavaonevision1_5.py.
Module Contents#
Classes#
Eager block-diagonal attention over variable-length image segments. |
|
Flash-attention-2 variant using flash_attn_varlen_func (requires flash_attn). |
|
SDPA variant with an additive block-diagonal mask. |
|
Rice ViT with per-image class-token insertion and block-diagonal attention. |
Functions#
Data#
API#
- nemo_automodel.components.models.llava_onevision.rice_vit.rotate_half(x: torch.Tensor) torch.Tensor#
- nemo_automodel.components.models.llava_onevision.rice_vit.apply_rotary_pos_emb_vision(
- q: torch.Tensor,
- k: torch.Tensor,
- cos: torch.Tensor,
- sin: torch.Tensor,
- class nemo_automodel.components.models.llava_onevision.rice_vit.RiceRotaryEmbedding(dim: int, theta: float = 10000.0)#
Bases:
torch.nn.ModuleInitialization
- forward(seqlen: int) torch.Tensor#
- class nemo_automodel.components.models.llava_onevision.rice_vit.RicePatchEmbed(
- patch_size: int = 14,
- temporal_patch_size: int = 1,
- in_channels: int = 3,
- embed_dim: int = 1024,
Bases:
torch.nn.ModuleInitialization
- forward(hidden_states: torch.Tensor) torch.Tensor#
- class nemo_automodel.components.models.llava_onevision.rice_vit.RicePatchMerger(
- dim: int,
- context_dim: int,
- spatial_merge_size: int = 2,
- layer_norm_eps: float = 1e-05,
Bases:
torch.nn.ModuleInitialization
- forward(x: torch.Tensor) torch.Tensor#
- class nemo_automodel.components.models.llava_onevision.rice_vit.RiceMlp(dim: int, hidden_dim: int, hidden_act: str)#
Bases:
torch.nn.ModuleInitialization
- forward(x: torch.Tensor) torch.Tensor#
- class nemo_automodel.components.models.llava_onevision.rice_vit.RiceAttention(dim: int, num_heads: int = 16)#
Bases:
torch.nn.ModuleEager block-diagonal attention over variable-length image segments.
Initialization
- forward(
- hidden_states: torch.Tensor,
- cu_seqlens: torch.Tensor,
- position_embeddings: Optional[Tuple[torch.Tensor, torch.Tensor]] = None,
- class nemo_automodel.components.models.llava_onevision.rice_vit.RiceFlashAttention2(dim: int, num_heads: int = 16)#
Bases:
torch.nn.ModuleFlash-attention-2 variant using flash_attn_varlen_func (requires flash_attn).
Initialization
- forward(
- hidden_states: torch.Tensor,
- cu_seqlens: torch.Tensor,
- position_embeddings: Tuple[torch.Tensor, torch.Tensor],
- class nemo_automodel.components.models.llava_onevision.rice_vit.RiceSdpaAttention(dim: int, num_heads: int = 16)#
Bases:
torch.nn.ModuleSDPA variant with an additive block-diagonal mask.
Initialization
- forward(
- hidden_states: torch.Tensor,
- cu_seqlens: torch.Tensor,
- position_embeddings: Tuple[torch.Tensor, torch.Tensor],
- nemo_automodel.components.models.llava_onevision.rice_vit._ATTENTION_CLASSES#
None
- class nemo_automodel.components.models.llava_onevision.rice_vit.RiceBlock(config, attn_implementation: str = 'eager')#
Bases:
torch.nn.ModuleInitialization
- forward(
- hidden_states: torch.Tensor,
- cu_seqlens: torch.Tensor,
- position_embeddings: Tuple[torch.Tensor, torch.Tensor],
- class nemo_automodel.components.models.llava_onevision.rice_vit.RiceTransformer(config, attn_implementation: str = 'eager')#
Bases:
torch.nn.ModuleRice ViT with per-image class-token insertion and block-diagonal attention.
Matches the HF reference: one CLS token is prepended at the start of each image segment inside the flat packed sequence, and the attention mask is built from a cu_seqlens that accounts for the extra CLS per segment.
Initialization
- property dtype: torch.dtype#
- rot_pos_emb(grid_thw: torch.Tensor) torch.Tensor#
- forward(
- pixel_values: torch.Tensor,
- grid_thw: torch.Tensor,