nemo_automodel.components.models.llama_nemotron_vl.model

View as Markdown

Module Contents

Classes

NameDescription
LlamaBidirectionalConfigConfiguration for bidirectional (non-causal) LLaMA model.
LlamaBidirectionalModelLlamaModel modified to use bidirectional (non-causal) attention.
LlamaNemotronVLConfigBase configuration for vision-language models combining vision and language components.
LlamaNemotronVLModelLlamaNemotron VL model for vision-language reranking.

Functions

NameDescription
_filter_vision_embeddings_by_image_flagsKeep only vision embeddings marked as real images.
_register_with_hf_auto_classesRegister bidirectional models with HuggingFace Auto classes.
_replace_image_token_embeddingsReplace image placeholder token embeddings with vision embeddings.
pool-
split_model-

Data

ModelClass

_DYNAMIC_CACHE_ACCEPTS_CONFIG

_HAS_NATIVE_BIDIRECTIONAL_MASK

_USE_PLURAL_CACHE_PARAM

__all__

_decoder_forward_params

_dynamic_cache_init_params

logger

API

class nemo_automodel.components.models.llama_nemotron_vl.model.LlamaBidirectionalConfig(
pooling = 'avg',
temperature = 1.0,
kwargs = {}
)

Bases: LlamaConfig

Configuration for bidirectional (non-causal) LLaMA model.

model_type
= 'llama_bidirec'
class nemo_automodel.components.models.llama_nemotron_vl.model.LlamaBidirectionalModel(
config: nemo_automodel.components.models.llama_nemotron_vl.model.LlamaBidirectionalConfig
)

Bases: LlamaModel

LlamaModel modified to use bidirectional (non-causal) attention. Supports transformers 4.44+ through 5.x with a unified forward() implementation. See https://huggingface.co/nvidia/llama-nemotron-embed-1b-v2 for version notes.

nemo_automodel.components.models.llama_nemotron_vl.model.LlamaBidirectionalModel._create_bidirectional_mask(
input_embeds: torch.Tensor,
attention_mask: torch.Tensor | None
) -> torch.Tensor | None
nemo_automodel.components.models.llama_nemotron_vl.model.LlamaBidirectionalModel.forward(
input_ids: torch.LongTensor | None = None,
attention_mask: torch.Tensor | None = None,
position_ids: torch.LongTensor | None = None,
past_key_values: transformers.cache_utils.Cache | None = None,
inputs_embeds: torch.FloatTensor | None = None,
cache_position: torch.LongTensor | None = None,
use_cache: bool | None = None,
output_hidden_states: bool | None = None,
kwargs = {}
) -> transformers.modeling_outputs.BaseModelOutputWithPast
class nemo_automodel.components.models.llama_nemotron_vl.model.LlamaNemotronVLConfig(
vision_config = None,
llm_config = None,
use_backbone_lora = 0,
use_llm_lora = 0,
select_layer = -1,
force_image_size = None,
downsample_ratio = 0.5,
template = None,
dynamic_image_size = False,
use_thumbnail = False,
min_dynamic_patch = 1,
max_dynamic_patch = 6,
mlp_checkpoint = True,
pre_feature_reduction = False,
keep_aspect_ratio = False,
vocab_size = -1,
q_max_length: typing.Optional[int] = 512,
p_max_length: typing.Optional[int] = 10240,
query_prefix: str = 'query:',
passage_prefix: str = 'passage:',
pooling: str = 'last',
bidirectional_attention: bool = False,
max_input_tiles: int = 2,
img_context_token_id: int = 128258,
kwargs = {}
)

Bases: PretrainedConfig

Base configuration for vision-language models combining vision and language components. This serves as the foundation for LlamaNemotronVL configurations.

llm_config
= LlamaBidirectionalConfig(**llm_config)
model_type
= 'llama_nemotron_vl'
sub_configs
vision_config
= SiglipVisionConfig(**vision_config)
vocab_size
= self.llm_config.vocab_size
class nemo_automodel.components.models.llama_nemotron_vl.model.LlamaNemotronVLModel(
config: nemo_automodel.components.models.llama_nemotron_vl.model.LlamaNemotronVLConfig,
vision_model: typing.Optional[transformers.PreTrainedModel] = None,
language_model: typing.Optional[transformers.PreTrainedModel] = None
)

Bases: PreTrainedModel

LlamaNemotron VL model for vision-language reranking. Combines a vision encoder (SigLIP) with a bidirectional language model (LLaMA) for cross-modal reranking tasks.

_no_split_modules
= ['LlamaDecoderLayer']
downsample_ratio
= config.downsample_ratio
main_input_name
= 'pixel_values'
mlp1
num_image_token
= int((grid_size * config.downsample_ratio) ** 2)
patch_size
= 14
processor
select_layer
= config.select_layer
template
= config.template
nemo_automodel.components.models.llama_nemotron_vl.model.LlamaNemotronVLModel._embed_batch(
inputs: typing.Dict[str, typing.Any],
pool_type: typing.Optional[str] = None
)

Encodes the inputs into a tensor of embeddings. Args: inputs: A dictionary of inputs to the model. You can prepare the inputs using the processor.process_queries and processor.process_documents methods. pool_type: The type of pooling to use. If None, the pooling type is set to the pooling type configured in the model. Returns: A tensor of embeddings.

nemo_automodel.components.models.llama_nemotron_vl.model.LlamaNemotronVLModel.build_collator(
processor = None,
kwargs = {}
)
nemo_automodel.components.models.llama_nemotron_vl.model.LlamaNemotronVLModel.encode_documents(
images: typing.Optional[typing.List[typing.Any]] = None,
texts: typing.Optional[typing.List[str]] = None,
kwargs = {}
)

Encodes the input document images and texts into a tensor of embeddings. Args: images: A list of PIL.Image of document pages images. texts: A list of document page texts. Returns: A tensor of embeddings.

nemo_automodel.components.models.llama_nemotron_vl.model.LlamaNemotronVLModel.encode_queries(
queries: typing.List[str],
kwargs = {}
)

Encodes the input queries into a tensor of embeddings. Args: queries: A list of queries. Returns: A tensor of embeddings.

nemo_automodel.components.models.llama_nemotron_vl.model.LlamaNemotronVLModel.extract_feature(
pixel_values
)

Extract and project vision features to language model space.

nemo_automodel.components.models.llama_nemotron_vl.model.LlamaNemotronVLModel.forward(
pixel_values: torch.FloatTensor = None,
input_ids: torch.LongTensor = None,
attention_mask: typing.Optional[torch.Tensor] = None,
position_ids: typing.Optional[torch.LongTensor] = None,
image_flags: typing.Optional[torch.LongTensor] = None,
past_key_values: typing.Optional[typing.List[torch.FloatTensor]] = None,
labels: typing.Optional[torch.LongTensor] = None,
use_cache: typing.Optional[bool] = None,
output_attentions: typing.Optional[bool] = None,
output_hidden_states: typing.Optional[bool] = None,
return_dict: typing.Optional[bool] = None,
num_patches_list: typing.Optional[typing.List[torch.Tensor]] = None,
run_dummy_vision: typing.Optional[bool] = None
) -> typing.Union[typing.Tuple, transformers.modeling_outputs.CausalLMOutputWithPast]
nemo_automodel.components.models.llama_nemotron_vl.model.LlamaNemotronVLModel.get_input_embeddings()
nemo_automodel.components.models.llama_nemotron_vl.model.LlamaNemotronVLModel.get_output_embeddings()
nemo_automodel.components.models.llama_nemotron_vl.model.LlamaNemotronVLModel.pixel_shuffle(
x,
scale_factor = 0.5
)
nemo_automodel.components.models.llama_nemotron_vl.model.LlamaNemotronVLModel.post_loss(
loss,
inputs
)
nemo_automodel.components.models.llama_nemotron_vl.model._filter_vision_embeddings_by_image_flags(
vit_embeds: torch.Tensor,
image_flags: typing.Optional[torch.Tensor]
) -> torch.Tensor

Keep only vision embeddings marked as real images.

nemo_automodel.components.models.llama_nemotron_vl.model._register_with_hf_auto_classes()

Register bidirectional models with HuggingFace Auto classes.

This is needed so that AutoModel.from_config(LlamaBidirectionalConfig) works inside LlamaForSequenceClassification.init.

nemo_automodel.components.models.llama_nemotron_vl.model._replace_image_token_embeddings(
input_embeds: torch.Tensor,
input_ids: torch.Tensor,
vit_embeds: torch.Tensor,
img_context_token_id: int
) -> torch.Tensor

Replace image placeholder token embeddings with vision embeddings.

nemo_automodel.components.models.llama_nemotron_vl.model.pool(
last_hidden_states: torch.Tensor,
attention_mask: torch.Tensor,
pool_type: str
) -> torch.Tensor
nemo_automodel.components.models.llama_nemotron_vl.model.split_model(
model_path,
device
)
nemo_automodel.components.models.llama_nemotron_vl.model.ModelClass = [LlamaNemotronVLModel]
nemo_automodel.components.models.llama_nemotron_vl.model._DYNAMIC_CACHE_ACCEPTS_CONFIG = 'config' in _dynamic_cache_init_params
nemo_automodel.components.models.llama_nemotron_vl.model._HAS_NATIVE_BIDIRECTIONAL_MASK = True
nemo_automodel.components.models.llama_nemotron_vl.model._USE_PLURAL_CACHE_PARAM = 'past_key_values' in _decoder_forward_params
nemo_automodel.components.models.llama_nemotron_vl.model.__all__ = ['LlamaNemotronVLModel', 'LlamaNemotronVLConfig', 'ModelClass']
nemo_automodel.components.models.llama_nemotron_vl.model._decoder_forward_params = inspect.signature(LlamaDecoderLayer.forward).parameters
nemo_automodel.components.models.llama_nemotron_vl.model._dynamic_cache_init_params = inspect.signature(DynamicCache.__init__).parameters
nemo_automodel.components.models.llama_nemotron_vl.model.logger = logging.get_logger(__name__)