nemo_automodel.components.models.llava_onevision.rice_vit#
Rice Vision Transformer for LLaVA-OneVision-1.5.
Module Contents#
Classes#
2D Rotary Position Embedding for Rice ViT. |
|
Patch embedding layer for Rice ViT. |
|
Merges spatial patches and projects to text hidden size. |
|
MLP block for Rice ViT. |
|
Multi-head attention with block-diagonal mask for variable-length images. |
|
Transformer block for Rice ViT. |
Functions#
Rotates half the hidden dims of the input. |
|
Apply rotary positional embeddings to vision attention. |
API#
- nemo_automodel.components.models.llava_onevision.rice_vit.rotate_half(x: torch.Tensor) torch.Tensor#
Rotates half the hidden dims of the input.
- nemo_automodel.components.models.llava_onevision.rice_vit.apply_rotary_pos_emb_vision(
- q: torch.Tensor,
- k: torch.Tensor,
- cos: torch.Tensor,
- sin: torch.Tensor,
Apply rotary positional embeddings to vision attention.
- class nemo_automodel.components.models.llava_onevision.rice_vit.RiceRotaryEmbedding(dim: int, theta: float = 10000.0)#
Bases:
torch.nn.Module2D Rotary Position Embedding for Rice ViT.
Initialization
- forward(seqlen: int) torch.Tensor#
- class nemo_automodel.components.models.llava_onevision.rice_vit.RicePatchEmbed(
- patch_size: int = 14,
- temporal_patch_size: int = 1,
- in_channels: int = 3,
- embed_dim: int = 1152,
Bases:
torch.nn.ModulePatch embedding layer for Rice ViT.
Initialization
- forward(hidden_states: torch.Tensor) torch.Tensor#
- class nemo_automodel.components.models.llava_onevision.rice_vit.RicePatchMerger(
- dim: int,
- context_dim: int,
- spatial_merge_size: int = 2,
- layer_norm_eps: float = 1e-05,
Bases:
torch.nn.ModuleMerges spatial patches and projects to text hidden size.
Initialization
- forward(x: torch.Tensor) torch.Tensor#
- class nemo_automodel.components.models.llava_onevision.rice_vit.RiceMlp(dim: int, hidden_dim: int, hidden_act: str)#
Bases:
torch.nn.ModuleMLP block for Rice ViT.
Initialization
- forward(x: torch.Tensor) torch.Tensor#
- class nemo_automodel.components.models.llava_onevision.rice_vit.RiceAttention(dim: int, num_heads: int = 16)#
Bases:
torch.nn.ModuleMulti-head attention with block-diagonal mask for variable-length images.
Initialization
- forward(
- hidden_states: torch.Tensor,
- cu_seqlens: torch.Tensor,
- rotary_pos_emb: Optional[torch.Tensor] = None,
- position_embeddings: Optional[Tuple[torch.Tensor, torch.Tensor]] = None,
- class nemo_automodel.components.models.llava_onevision.rice_vit.RiceBlock(config)#
Bases:
torch.nn.ModuleTransformer block for Rice ViT.
Initialization
- forward(
- hidden_states: torch.Tensor,
- cu_seqlens: torch.Tensor,
- rotary_pos_emb: Optional[torch.Tensor] = None,
- position_embeddings: Optional[Tuple[torch.Tensor, torch.Tensor]] = None,