nemo_export.model_adapters.embedding.embedding_adapter
#
Module Contents#
Classes#
Wraps a text embedding model with pooling and normalization for bidirectional encoding. |
|
Pooling layer that aggregates token-level embeddings into sequence-level embeddings. |
Functions#
Factory function to create a LlamaBidirectionalHFAdapter with proper configuration. |
API#
- class nemo_export.model_adapters.embedding.embedding_adapter.LlamaBidirectionalHFAdapter(
- model: torch.nn.Module,
- normalize: bool,
- pooling_module: torch.nn.Module,
Bases:
torch.nn.Module
Wraps a text embedding model with pooling and normalization for bidirectional encoding.
This adapter combines a transformer model with configurable pooling strategies and optional L2 normalization to produce fixed-size embeddings from variable-length text sequences. It supports dimension reduction and various pooling methods including average, CLS token, and last token pooling.
- Parameters:
model – The underlying transformer model (e.g., AutoModel from HuggingFace).
normalize – Whether to apply L2 normalization to the output embeddings.
pooling_module – The pooling module to use for aggregating token embeddings.
Initialization
Initialize the LlamaBidirectionalHFAdapter.
- Parameters:
model – The transformer model to wrap.
normalize – If True, applies L2 normalization to output embeddings.
pooling_module – Module that handles pooling of token embeddings.
- property device: torch.device#
Returns the device of the underlying model.
- Returns:
The device where the model parameters are located.
- Return type:
torch.device
- forward(
- input_ids: torch.Tensor,
- attention_mask: torch.Tensor,
- token_type_ids: Optional[torch.Tensor] = None,
- dimensions: Optional[torch.Tensor] = None,
Forward pass through the adapted model to generate embeddings.
- Parameters:
input_ids – Token IDs of shape (batch_size, sequence_length).
attention_mask – Attention mask of shape (batch_size, sequence_length).
token_type_ids – Optional token type IDs for models that use them.
dimensions – Optional tensor specifying the desired output dimensions for each sample in the batch. If provided, embeddings will be truncated/masked to these dimensions.
- Returns:
Pooled and optionally normalized embeddings of shape (batch_size, embedding_dim) or (batch_size, max_dimensions) if dimensions parameter is used.
- Return type:
torch.Tensor
- Raises:
ValueError – If dimensions contain non-positive values.
- class nemo_export.model_adapters.embedding.embedding_adapter.Pooling(pooling_mode: str)#
Bases:
torch.nn.Module
Pooling layer that aggregates token-level embeddings into sequence-level embeddings.
Supports multiple pooling strategies:
‘avg’: Average pooling over non-padded tokens
‘cls’: Uses the first token (CLS token) with right padding
‘cls__left’: Uses the first non-padded token with left padding
‘last’: Uses the last token with left padding
‘last__right’: Uses the last non-padded token with right padding
- Parameters:
pooling_mode – The pooling strategy to use.
Initialization
Initialize the Pooling layer.
- Parameters:
pooling_mode – The pooling strategy. Must be one of: ‘avg’, ‘cls’, ‘cls__left’, ‘last’, ‘last__right’.
- forward(
- last_hidden_states: torch.Tensor,
- attention_mask: torch.Tensor,
Apply pooling to the hidden states.
- Parameters:
last_hidden_states – Hidden states from the transformer model of shape (batch_size, sequence_length, hidden_size).
attention_mask – Attention mask of shape (batch_size, sequence_length) where 1 indicates real tokens and 0 indicates padding.
- Returns:
Pooled embeddings of shape (batch_size, hidden_size).
- Return type:
torch.Tensor
- Raises:
ValueError – If the pooling_mode is not supported.
- nemo_export.model_adapters.embedding.embedding_adapter.get_llama_bidirectional_hf_model(
- model_name_or_path: Union[str, os.PathLike[str]],
- normalize: bool,
- pooling_mode: Optional[Literal[avg, cls, last]] = None,
- torch_dtype: Optional[Union[torch.dtype, str]] = None,
- trust_remote_code: bool = False,
Factory function to create a LlamaBidirectionalHFAdapter with proper configuration.
This function loads a HuggingFace transformer model and tokenizer, configures the appropriate pooling strategy based on the tokenizer’s padding side, and wraps everything in a LlamaBidirectionalHFAdapter.
Special handling is provided for NVEmbedModel which has separate embedding and latent attention components.
- Parameters:
model_name_or_path – Path to the model directory or HuggingFace model identifier.
normalize – Whether to apply L2 normalization to the output embeddings.
pooling_mode –
The pooling strategy to use. If None, defaults to “avg”. Will be automatically adjusted based on tokenizer padding side:
”last” becomes “last__right” for right-padding tokenizers
”cls” becomes “cls__left” for left-padding tokenizers
torch_dtype – The torch data type to use for the model. If None, uses model default.
trust_remote_code – Whether to trust remote code when loading the model.
- Returns:
A tuple containing: - LlamaBidirectionalHFAdapter: The configured adapter model - AutoTokenizer: The tokenizer for the model
- Return type:
tuple
.. rubric:: Example
model, tokenizer = get_llama_bidirectional_hf_model( … “sentence-transformers/all-MiniLM-L6-v2”, … normalize=True, … pooling_mode=”avg” … )
Use model and tokenizer for embedding generation