nemo_automodel.components.models.llama_nemotron_vl.model
nemo_automodel.components.models.llama_nemotron_vl.model
Module Contents
Classes
Functions
Data
_HAS_NATIVE_BIDIRECTIONAL_MASK
API
Bases: LlamaConfig
Configuration for bidirectional (non-causal) LLaMA model.
Bases: LlamaModel
LlamaModel modified to use bidirectional (non-causal) attention. Supports transformers 4.44+ through 5.x with a unified forward() implementation. See https://huggingface.co/nvidia/llama-nemotron-embed-1b-v2 for version notes.
Bases: PretrainedConfig
Base configuration for vision-language models combining vision and language components. This serves as the foundation for LlamaNemotronVL configurations.
Bases: PreTrainedModel
LlamaNemotron VL model for vision-language reranking. Combines a vision encoder (SigLIP) with a bidirectional language model (LLaMA) for cross-modal reranking tasks.
Encodes the inputs into a tensor of embeddings. Args: inputs: A dictionary of inputs to the model. You can prepare the inputs using the processor.process_queries and processor.process_documents methods. pool_type: The type of pooling to use. If None, the pooling type is set to the pooling type configured in the model. Returns: A tensor of embeddings.
Encodes the input document images and texts into a tensor of embeddings. Args: images: A list of PIL.Image of document pages images. texts: A list of document page texts. Returns: A tensor of embeddings.
Encodes the input queries into a tensor of embeddings. Args: queries: A list of queries. Returns: A tensor of embeddings.
Extract and project vision features to language model space.
Keep only vision embeddings marked as real images.
Register bidirectional models with HuggingFace Auto classes.
This is needed so that AutoModel.from_config(LlamaBidirectionalConfig) works inside LlamaForSequenceClassification.init.
Replace image placeholder token embeddings with vision embeddings.