nemo_automodel.components.models.llava_onevision.rice_vit
nemo_automodel.components.models.llava_onevision.rice_vit
Rice Vision Transformer for LLaVA-OneVision-1.5.
Ported from lmms-lab/LLaVA-OneVision-1.5’s modeling_llavaonevision1_5.py.
Module Contents
Classes
Functions
Data
API
Bases: Module
Eager block-diagonal attention over variable-length image segments.
Bases: Module
Bases: Module
Flash-attention-2 variant using flash_attn_varlen_func (requires flash_attn).
Bases: Module
Bases: Module
Bases: Module
Bases: Module
Bases: Module
SDPA variant with an additive block-diagonal mask.
Bases: Module
Rice ViT with per-image class-token insertion and block-diagonal attention.
Matches the HF reference: one CLS token is prepended at the start of each image segment inside the flat packed sequence, and the attention mask is built from a cu_seqlens that accounts for the extra CLS per segment.