nemo_automodel.components.models.llava_onevision.model#
LLaVA-OneVision-1.5 model implementation.
Matches the layout of lmms-lab/LLaVA-OneVision-1.5-*-{Base,Instruct} so that HF safetensors load into this module tree via LlavaOneVisionStateDictAdapter with only regex-renames (no tensor transforms).
In-memory tree: model.visual.* (RiceTransformer) model.language_model.* (transformers.Qwen3Model — LLaVA-OV-1.5’s text backbone is Qwen3) lm_head.* (nn.Linear)
Module Contents#
Classes#
Configuration for the Rice ViT vision tower. |
|
Top-level config for LLaVA-OneVision-1.5. |
|
Combined vision + language backbone. Returns last_hidden_state. |
|
LLaVA-OneVision-1.5 for conditional generation (Rice ViT + Qwen3 text). |
Functions#
Coerce a text_config dict from HF (or user) into a Qwen3Config. |
|
Accept a raw HF remote-code text config and return a Qwen3Config. |
|
Data#
API#
- nemo_automodel.components.models.llava_onevision.model.LOGGER#
‘getLogger(…)’
- class nemo_automodel.components.models.llava_onevision.model.RiceConfig(
- depth: int = 24,
- embed_dim: int = 1024,
- hidden_size: int = 1024,
- hidden_act: str = 'gelu',
- intermediate_size: int = 4096,
- num_heads: int = 16,
- in_channels: int = 3,
- patch_size: int = 14,
- spatial_merge_size: int = 2,
- temporal_patch_size: int = 1,
- initializer_range: float = 0.02,
- layer_norm_eps: float = 1e-05,
- text_hidden_size: int = 2560,
- **kwargs,
Bases:
transformers.configuration_utils.PretrainedConfigConfiguration for the Rice ViT vision tower.
Initialization
- model_type#
‘rice_vit’
- base_config_key#
‘vision_config’
- class nemo_automodel.components.models.llava_onevision.model.Llavaonevision1_5Config(
- text_config: Optional[Union[Dict, transformers.configuration_utils.PretrainedConfig]] = None,
- vision_config: Optional[Union[Dict, nemo_automodel.components.models.llava_onevision.model.RiceConfig]] = None,
- image_token_id: int = 151655,
- video_token_id: int = 151656,
- vision_start_token_id: int = 151652,
- vision_end_token_id: int = 151653,
- vocab_size: int = 152064,
- architectures: Optional[List[str]] = None,
- **kwargs,
Bases:
transformers.configuration_utils.PretrainedConfigTop-level config for LLaVA-OneVision-1.5.
model_typematches the on-hub value exactly soAutoConfig.from_pretrainedresolves to this class withouttrust_remote_codeonce registered.Initialization
- model_type#
‘llavaonevision1_5’
- sub_configs#
None
- to_dict() Dict[str, Any]#
- nemo_automodel.components.models.llava_onevision.model._build_text_config(
- data: Dict[str, Any],
Coerce a text_config dict from HF (or user) into a Qwen3Config.
LLaVA-OV-1.5’s text backbone is Qwen3 (q/k norm, GQA, standard SiLU MLP). On-hub
model_typeisLLaVAOneVision1_5_text; we drop it so Qwen3Config doesn’t reject the kwargs.
- nemo_automodel.components.models.llava_onevision.model._coerce_text_config(
- tc: Any,
Accept a raw HF remote-code text config and return a Qwen3Config.
The constructor path for NeMo custom models is
cls(hf_config)wherehf_configmay be the remote-codeLlavaonevision1_5Configwhosetext_configis aLLaVAOneVision1_5_TextConfiginstance. Normalize to Qwen3Config so the innerQwen3Modelgets fields it understands.
- nemo_automodel.components.models.llava_onevision.model._coerce_vision_config(
- vc: Any,
- class nemo_automodel.components.models.llava_onevision.model.LLaVAOneVision1_5_Model(
- config: nemo_automodel.components.models.llava_onevision.model.Llavaonevision1_5Config,
- attn_implementation: str = 'eager',
Bases:
torch.nn.ModuleCombined vision + language backbone. Returns last_hidden_state.
Initialization
- get_input_embeddings()#
- set_input_embeddings(value)#
- get_image_features(
- pixel_values: torch.FloatTensor,
- image_grid_thw: torch.LongTensor,
- forward(
- input_ids: Optional[torch.LongTensor] = None,
- attention_mask: Optional[torch.Tensor] = None,
- position_ids: Optional[torch.LongTensor] = None,
- past_key_values: Optional[List[torch.FloatTensor]] = None,
- inputs_embeds: Optional[torch.FloatTensor] = None,
- pixel_values: Optional[torch.FloatTensor] = None,
- pixel_values_videos: Optional[torch.FloatTensor] = None,
- image_grid_thw: Optional[torch.LongTensor] = None,
- video_grid_thw: Optional[torch.LongTensor] = None,
- use_cache: Optional[bool] = None,
- **kwargs,
- class nemo_automodel.components.models.llava_onevision.model.LLaVAOneVision1_5_ForConditionalGeneration(
- config,
- attn_implementation: Optional[str] = None,
- **kwargs,
Bases:
nemo_automodel.components.models.common.hf_checkpointing_mixin.HFCheckpointingMixin,torch.nn.ModuleLLaVA-OneVision-1.5 for conditional generation (Rice ViT + Qwen3 text).
Initialization
- config_class#
None
- classmethod from_config(config, **kwargs)#
- classmethod from_pretrained(
- pretrained_model_name_or_path: str,
- *model_args,
- **kwargs,
- property dtype#
- property visual#
- property language_model#
- get_input_embeddings()#
- set_input_embeddings(value)#
- get_output_embeddings()#
- set_output_embeddings(new_embeddings)#
- forward(
- input_ids: Optional[torch.LongTensor] = None,
- attention_mask: Optional[torch.Tensor] = None,
- position_ids: Optional[torch.LongTensor] = None,
- past_key_values: Optional[List[torch.FloatTensor]] = None,
- inputs_embeds: Optional[torch.FloatTensor] = None,
- labels: Optional[torch.LongTensor] = None,
- use_cache: Optional[bool] = None,
- pixel_values: Optional[torch.FloatTensor] = None,
- pixel_values_videos: Optional[torch.FloatTensor] = None,
- image_grid_thw: Optional[torch.LongTensor] = None,
- video_grid_thw: Optional[torch.LongTensor] = None,
- **kwargs,