nemo_automodel.components.models.llama_nemotron_vl.model

Module Contents

Classes

Name	Description
`LlamaBidirectionalConfig`	Configuration for bidirectional (non-causal) LLaMA model.
`LlamaBidirectionalModel`	LlamaModel modified to use bidirectional (non-causal) attention.
`LlamaNemotronVLConfig`	Base configuration for vision-language models combining vision and language components.
`LlamaNemotronVLModel`	LlamaNemotron VL model for vision-language reranking.

Functions

Name	Description
`_filter_vision_embeddings_by_image_flags`	Keep only vision embeddings marked as real images.
`_register_with_hf_auto_classes`	Register bidirectional models with HuggingFace Auto classes.
`_replace_image_token_embeddings`	Replace image placeholder token embeddings with vision embeddings.
`pool`	-
`split_model`	-

Data

ModelClass

_DYNAMIC_CACHE_ACCEPTS_CONFIG

_HAS_NATIVE_BIDIRECTIONAL_MASK

_USE_PLURAL_CACHE_PARAM

__all__

_decoder_forward_params

_dynamic_cache_init_params

logger

API

class nemo_automodel.components.models.llama_nemotron_vl.model.LlamaBidirectionalConfig(
    pooling = 'avg',
    temperature = 1.0,
    kwargs = {}
)

Bases: LlamaConfig

Configuration for bidirectional (non-causal) LLaMA model.

model_type

= 'llama_bidirec'

class nemo_automodel.components.models.llama_nemotron_vl.model.LlamaBidirectionalModel(
    config: nemo_automodel.components.models.llama_nemotron_vl.model.LlamaBidirectionalConfig
)

Bases: LlamaModel

LlamaModel modified to use bidirectional (non-causal) attention. Supports transformers 4.44+ through 5.x with a unified forward() implementation. See https://huggingface.co/nvidia/llama-nemotron-embed-1b-v2 for version notes.

nemo_automodel.components.models.llama_nemotron_vl.model.LlamaBidirectionalModel._create_bidirectional_mask(
    input_embeds: torch.Tensor,
    attention_mask: torch.Tensor | None
) -> torch.Tensor | None

nemo_automodel.components.models.llama_nemotron_vl.model.LlamaBidirectionalModel.forward(
    input_ids: torch.LongTensor | None = None,
    attention_mask: torch.Tensor | None = None,
    position_ids: torch.LongTensor | None = None,
    past_key_values: transformers.cache_utils.Cache | None = None,
    inputs_embeds: torch.FloatTensor | None = None,
    cache_position: torch.LongTensor | None = None,
    use_cache: bool | None = None,
    output_hidden_states: bool | None = None,
    kwargs = {}
) -> transformers.modeling_outputs.BaseModelOutputWithPast

class nemo_automodel.components.models.llama_nemotron_vl.model.LlamaNemotronVLConfig(
    vision_config = None,
    llm_config = None,
    use_backbone_lora = 0,
    use_llm_lora = 0,
    select_layer = -1,
    force_image_size = None,
    downsample_ratio = 0.5,
    template = None,
    dynamic_image_size = False,
    use_thumbnail = False,
    min_dynamic_patch = 1,
    max_dynamic_patch = 6,
    mlp_checkpoint = True,
    pre_feature_reduction = False,
    keep_aspect_ratio = False,
    vocab_size = -1,
    q_max_length: typing.Optional[int] = 512,
    p_max_length: typing.Optional[int] = 10240,
    query_prefix: str = 'query:',
    passage_prefix: str = 'passage:',
    pooling: str = 'last',
    bidirectional_attention: bool = False,
    max_input_tiles: int = 2,
    img_context_token_id: int = 128258,
    kwargs = {}
)

Bases: PretrainedConfig

Base configuration for vision-language models combining vision and language components. This serves as the foundation for LlamaNemotronVL configurations.

llm_config

= LlamaBidirectionalConfig(**llm_config)

model_type

= 'llama_nemotron_vl'

sub_configs

vision_config

= SiglipVisionConfig(**vision_config)

vocab_size

= self.llm_config.vocab_size

class nemo_automodel.components.models.llama_nemotron_vl.model.LlamaNemotronVLModel(
    config: nemo_automodel.components.models.llama_nemotron_vl.model.LlamaNemotronVLConfig,
    vision_model: typing.Optional[transformers.PreTrainedModel] = None,
    language_model: typing.Optional[transformers.PreTrainedModel] = None
)

Bases: PreTrainedModel

LlamaNemotron VL model for vision-language reranking. Combines a vision encoder (SigLIP) with a bidirectional language model (LLaMA) for cross-modal reranking tasks.

_no_split_modules

= ['LlamaDecoderLayer']

downsample_ratio

= config.downsample_ratio

main_input_name

= 'pixel_values'

mlp1

num_image_token

= int((grid_size * config.downsample_ratio) ** 2)

patch_size

= 14

processor

select_layer

= config.select_layer

template

= config.template

nemo_automodel.components.models.llama_nemotron_vl.model.LlamaNemotronVLModel._embed_batch(
    inputs: typing.Dict[str, typing.Any],
    pool_type: typing.Optional[str] = None
)

Encodes the inputs into a tensor of embeddings. Args: inputs: A dictionary of inputs to the model. You can prepare the inputs using the processor.process_queries and processor.process_documents methods. pool_type: The type of pooling to use. If None, the pooling type is set to the pooling type configured in the model. Returns: A tensor of embeddings.

nemo_automodel.components.models.llama_nemotron_vl.model.LlamaNemotronVLModel.build_collator(
    processor = None,
    kwargs = {}
)

nemo_automodel.components.models.llama_nemotron_vl.model.LlamaNemotronVLModel.encode_documents(
    images: typing.Optional[typing.List[typing.Any]] = None,
    texts: typing.Optional[typing.List[str]] = None,
    kwargs = {}
)

Encodes the input document images and texts into a tensor of embeddings. Args: images: A list of PIL.Image of document pages images. texts: A list of document page texts. Returns: A tensor of embeddings.

nemo_automodel.components.models.llama_nemotron_vl.model.LlamaNemotronVLModel.encode_queries(
    queries: typing.List[str],
    kwargs = {}
)

Encodes the input queries into a tensor of embeddings. Args: queries: A list of queries. Returns: A tensor of embeddings.

nemo_automodel.components.models.llama_nemotron_vl.model.LlamaNemotronVLModel.extract_feature(
    pixel_values
)

Extract and project vision features to language model space.

nemo_automodel.components.models.llama_nemotron_vl.model.LlamaNemotronVLModel.forward(
    pixel_values: torch.FloatTensor = None,
    input_ids: torch.LongTensor = None,
    attention_mask: typing.Optional[torch.Tensor] = None,
    position_ids: typing.Optional[torch.LongTensor] = None,
    image_flags: typing.Optional[torch.LongTensor] = None,
    past_key_values: typing.Optional[typing.List[torch.FloatTensor]] = None,
    labels: typing.Optional[torch.LongTensor] = None,
    use_cache: typing.Optional[bool] = None,
    output_attentions: typing.Optional[bool] = None,
    output_hidden_states: typing.Optional[bool] = None,
    return_dict: typing.Optional[bool] = None,
    num_patches_list: typing.Optional[typing.List[torch.Tensor]] = None,
    run_dummy_vision: typing.Optional[bool] = None
) -> typing.Union[typing.Tuple, transformers.modeling_outputs.CausalLMOutputWithPast]

nemo_automodel.components.models.llama_nemotron_vl.model.LlamaNemotronVLModel.get_input_embeddings()

nemo_automodel.components.models.llama_nemotron_vl.model.LlamaNemotronVLModel.get_output_embeddings()

nemo_automodel.components.models.llama_nemotron_vl.model.LlamaNemotronVLModel.pixel_shuffle(
    x,
    scale_factor = 0.5
)

nemo_automodel.components.models.llama_nemotron_vl.model.LlamaNemotronVLModel.post_loss(
    loss,
    inputs
)

nemo_automodel.components.models.llama_nemotron_vl.model._filter_vision_embeddings_by_image_flags(
    vit_embeds: torch.Tensor,
    image_flags: typing.Optional[torch.Tensor]
) -> torch.Tensor

Keep only vision embeddings marked as real images.

nemo_automodel.components.models.llama_nemotron_vl.model._register_with_hf_auto_classes()

This is needed so that AutoModel.from_config(LlamaBidirectionalConfig) works inside LlamaForSequenceClassification.init.

nemo_automodel.components.models.llama_nemotron_vl.model._replace_image_token_embeddings(
    input_embeds: torch.Tensor,
    input_ids: torch.Tensor,
    vit_embeds: torch.Tensor,
    img_context_token_id: int
) -> torch.Tensor

Replace image placeholder token embeddings with vision embeddings.

nemo_automodel.components.models.llama_nemotron_vl.model.pool(
    last_hidden_states: torch.Tensor,
    attention_mask: torch.Tensor,
    pool_type: str
) -> torch.Tensor

nemo_automodel.components.models.llama_nemotron_vl.model.split_model(
    model_path,
    device
)

nemo_automodel.components.models.llama_nemotron_vl.model.ModelClass = [LlamaNemotronVLModel]

nemo_automodel.components.models.llama_nemotron_vl.model._DYNAMIC_CACHE_ACCEPTS_CONFIG = 'config' in _dynamic_cache_init_params

nemo_automodel.components.models.llama_nemotron_vl.model._HAS_NATIVE_BIDIRECTIONAL_MASK = True

nemo_automodel.components.models.llama_nemotron_vl.model._USE_PLURAL_CACHE_PARAM = 'past_key_values' in _decoder_forward_params

nemo_automodel.components.models.llama_nemotron_vl.model.__all__ = ['LlamaNemotronVLModel', 'LlamaNemotronVLConfig', 'ModelClass']

nemo_automodel.components.models.llama_nemotron_vl.model._decoder_forward_params = inspect.signature(LlamaDecoderLayer.forward).parameters

nemo_automodel.components.models.llama_nemotron_vl.model._dynamic_cache_init_params = inspect.signature(DynamicCache.__init__).parameters

nemo_automodel.components.models.llama_nemotron_vl.model.logger = logging.get_logger(__name__)