nemo_automodel.components.models.llava_onevision.model#

LLaVA-OneVision-1.5 model implementation.

Matches the layout of lmms-lab/LLaVA-OneVision-1.5-*-{Base,Instruct} so that HF safetensors load into this module tree via LlavaOneVisionStateDictAdapter with only regex-renames (no tensor transforms).

In-memory tree: model.visual.* (RiceTransformer) model.language_model.* (transformers.Qwen3Model — LLaVA-OV-1.5’s text backbone is Qwen3) lm_head.* (nn.Linear)

Module Contents#

Classes#

RiceConfig

Configuration for the Rice ViT vision tower.

Llavaonevision1_5Config

Top-level config for LLaVA-OneVision-1.5.

LLaVAOneVision1_5_Model

Combined vision + language backbone. Returns last_hidden_state.

LLaVAOneVision1_5_ForConditionalGeneration

LLaVA-OneVision-1.5 for conditional generation (Rice ViT + Qwen3 text).

Functions#

_build_text_config

Coerce a text_config dict from HF (or user) into a Qwen3Config.

_coerce_text_config

Accept a raw HF remote-code text config and return a Qwen3Config.

_coerce_vision_config

Data#

API#

nemo_automodel.components.models.llava_onevision.model.LOGGER#

‘getLogger(…)’

class nemo_automodel.components.models.llava_onevision.model.RiceConfig(
depth: int = 24,
embed_dim: int = 1024,
hidden_size: int = 1024,
hidden_act: str = 'gelu',
intermediate_size: int = 4096,
num_heads: int = 16,
in_channels: int = 3,
patch_size: int = 14,
spatial_merge_size: int = 2,
temporal_patch_size: int = 1,
initializer_range: float = 0.02,
layer_norm_eps: float = 1e-05,
text_hidden_size: int = 2560,
**kwargs,
)#

Bases: transformers.configuration_utils.PretrainedConfig

Configuration for the Rice ViT vision tower.

Initialization

model_type#

‘rice_vit’

base_config_key#

‘vision_config’

class nemo_automodel.components.models.llava_onevision.model.Llavaonevision1_5Config(
text_config: Optional[Union[Dict, transformers.configuration_utils.PretrainedConfig]] = None,
vision_config: Optional[Union[Dict, nemo_automodel.components.models.llava_onevision.model.RiceConfig]] = None,
image_token_id: int = 151655,
video_token_id: int = 151656,
vision_start_token_id: int = 151652,
vision_end_token_id: int = 151653,
vocab_size: int = 152064,
architectures: Optional[List[str]] = None,
**kwargs,
)#

Bases: transformers.configuration_utils.PretrainedConfig

Top-level config for LLaVA-OneVision-1.5.

model_type matches the on-hub value exactly so AutoConfig.from_pretrained resolves to this class without trust_remote_code once registered.

Initialization

model_type#

‘llavaonevision1_5’

sub_configs#

None

to_dict() Dict[str, Any]#
nemo_automodel.components.models.llava_onevision.model._build_text_config(
data: Dict[str, Any],
) transformers.configuration_utils.PretrainedConfig#

Coerce a text_config dict from HF (or user) into a Qwen3Config.

LLaVA-OV-1.5’s text backbone is Qwen3 (q/k norm, GQA, standard SiLU MLP). On-hub model_type is LLaVAOneVision1_5_text; we drop it so Qwen3Config doesn’t reject the kwargs.

nemo_automodel.components.models.llava_onevision.model._coerce_text_config(
tc: Any,
) transformers.configuration_utils.PretrainedConfig#

Accept a raw HF remote-code text config and return a Qwen3Config.

The constructor path for NeMo custom models is cls(hf_config) where hf_config may be the remote-code Llavaonevision1_5Config whose text_config is a LLaVAOneVision1_5_TextConfig instance. Normalize to Qwen3Config so the inner Qwen3Model gets fields it understands.

nemo_automodel.components.models.llava_onevision.model._coerce_vision_config(
vc: Any,
) nemo_automodel.components.models.llava_onevision.model.RiceConfig#
class nemo_automodel.components.models.llava_onevision.model.LLaVAOneVision1_5_Model(
config: nemo_automodel.components.models.llava_onevision.model.Llavaonevision1_5Config,
attn_implementation: str = 'eager',
)#

Bases: torch.nn.Module

Combined vision + language backbone. Returns last_hidden_state.

Initialization

get_input_embeddings()#
set_input_embeddings(value)#
get_image_features(
pixel_values: torch.FloatTensor,
image_grid_thw: torch.LongTensor,
) torch.Tensor#
forward(
input_ids: Optional[torch.LongTensor] = None,
attention_mask: Optional[torch.Tensor] = None,
position_ids: Optional[torch.LongTensor] = None,
past_key_values: Optional[List[torch.FloatTensor]] = None,
inputs_embeds: Optional[torch.FloatTensor] = None,
pixel_values: Optional[torch.FloatTensor] = None,
pixel_values_videos: Optional[torch.FloatTensor] = None,
image_grid_thw: Optional[torch.LongTensor] = None,
video_grid_thw: Optional[torch.LongTensor] = None,
use_cache: Optional[bool] = None,
**kwargs,
)#
class nemo_automodel.components.models.llava_onevision.model.LLaVAOneVision1_5_ForConditionalGeneration(
config,
attn_implementation: Optional[str] = None,
**kwargs,
)#

Bases: nemo_automodel.components.models.common.hf_checkpointing_mixin.HFCheckpointingMixin, torch.nn.Module

LLaVA-OneVision-1.5 for conditional generation (Rice ViT + Qwen3 text).

Initialization

config_class#

None

classmethod from_config(config, **kwargs)#
classmethod from_pretrained(
pretrained_model_name_or_path: str,
*model_args,
**kwargs,
)#
property dtype#
property visual#
property language_model#
get_input_embeddings()#
set_input_embeddings(value)#
get_output_embeddings()#
set_output_embeddings(new_embeddings)#
forward(
input_ids: Optional[torch.LongTensor] = None,
attention_mask: Optional[torch.Tensor] = None,
position_ids: Optional[torch.LongTensor] = None,
past_key_values: Optional[List[torch.FloatTensor]] = None,
inputs_embeds: Optional[torch.FloatTensor] = None,
labels: Optional[torch.LongTensor] = None,
use_cache: Optional[bool] = None,
pixel_values: Optional[torch.FloatTensor] = None,
pixel_values_videos: Optional[torch.FloatTensor] = None,
image_grid_thw: Optional[torch.LongTensor] = None,
video_grid_thw: Optional[torch.LongTensor] = None,
**kwargs,
) Union[Tuple, transformers.modeling_outputs.CausalLMOutputWithPast]#