`nemo_export.model_adapters.embedding.embedding_adapter`#

Module Contents#

Classes#

`LlamaBidirectionalHFAdapter`	Wraps a text embedding model with pooling and normalization for bidirectional encoding.
`Pooling`	Pooling layer that aggregates token-level embeddings into sequence-level embeddings.

Functions#

get_llama_bidirectional_hf_model

Factory function to create a LlamaBidirectionalHFAdapter with proper configuration.

API#

class nemo_export.model_adapters.embedding.embedding_adapter.LlamaBidirectionalHFAdapter( model: torch.nn.Module, normalize: bool, pooling_module: torch.nn.Module, )#

Bases: torch.nn.Module

Wraps a text embedding model with pooling and normalization for bidirectional encoding.

This adapter combines a transformer model with configurable pooling strategies and optional L2 normalization to produce fixed-size embeddings from variable-length text sequences. It supports dimension reduction and various pooling methods including average, CLS token, and last token pooling.

Parameters:

model – The underlying transformer model (e.g., AutoModel from HuggingFace).
normalize – Whether to apply L2 normalization to the output embeddings.
pooling_module – The pooling module to use for aggregating token embeddings.

Initialization

Initialize the LlamaBidirectionalHFAdapter.

Parameters:

model – The transformer model to wrap.
normalize – If True, applies L2 normalization to output embeddings.
pooling_module – Module that handles pooling of token embeddings.

property device: torch.device#

Returns the device of the underlying model.

Returns:: The device where the model parameters are located.
Return type:: torch.device

forward( input_ids: torch.Tensor, attention_mask: torch.Tensor, token_type_ids: Optional[torch.Tensor] = None, dimensions: Optional[torch.Tensor] = None, ) → torch.Tensor#

Forward pass through the adapted model to generate embeddings.

Parameters:

input_ids – Token IDs of shape (batch_size, sequence_length).
attention_mask – Attention mask of shape (batch_size, sequence_length).
token_type_ids – Optional token type IDs for models that use them.
dimensions – Optional tensor specifying the desired output dimensions for each sample in the batch. If provided, embeddings will be truncated/masked to these dimensions.

Returns:

Pooled and optionally normalized embeddings of shape (batch_size, embedding_dim) or (batch_size, max_dimensions) if dimensions parameter is used.

Return type:

torch.Tensor

Raises:

ValueError – If dimensions contain non-positive values.

class nemo_export.model_adapters.embedding.embedding_adapter.Pooling(pooling_mode: str)#

Bases: torch.nn.Module

Pooling layer that aggregates token-level embeddings into sequence-level embeddings.

Supports multiple pooling strategies:

‘avg’: Average pooling over non-padded tokens
‘cls’: Uses the first token (CLS token) with right padding
‘cls__left’: Uses the first non-padded token with left padding
‘last’: Uses the last token with left padding
‘last__right’: Uses the last non-padded token with right padding

Parameters:: pooling_mode – The pooling strategy to use.

Initialization

Initialize the Pooling layer.

Parameters:: pooling_mode – The pooling strategy. Must be one of: ‘avg’, ‘cls’, ‘cls__left’, ‘last’, ‘last__right’.

forward( last_hidden_states: torch.Tensor, attention_mask: torch.Tensor, ) → torch.Tensor#

Apply pooling to the hidden states.

Parameters:

last_hidden_states – Hidden states from the transformer model of shape (batch_size, sequence_length, hidden_size).
attention_mask – Attention mask of shape (batch_size, sequence_length) where 1 indicates real tokens and 0 indicates padding.

Returns:

Pooled embeddings of shape (batch_size, hidden_size).

Return type:

torch.Tensor

Raises:

ValueError – If the pooling_mode is not supported.

nemo_export.model_adapters.embedding.embedding_adapter.get_llama_bidirectional_hf_model( model_name_or_path: Union[str, os.PathLike[str]], normalize: bool, pooling_mode: Optional[Literal[avg, cls, last]] = None, torch_dtype: Optional[Union[torch.dtype, str]] = None, trust_remote_code: bool = False, )#

Factory function to create a LlamaBidirectionalHFAdapter with proper configuration.

This function loads a HuggingFace transformer model and tokenizer, configures the appropriate pooling strategy based on the tokenizer’s padding side, and wraps everything in a LlamaBidirectionalHFAdapter.

Special handling is provided for NVEmbedModel which has separate embedding and latent attention components.

Parameters:

model_name_or_path – Path to the model directory or HuggingFace model identifier.
normalize – Whether to apply L2 normalization to the output embeddings.
pooling_mode –
The pooling strategy to use. If None, defaults to “avg”. Will be automatically adjusted based on tokenizer padding side:
- ”last” becomes “last__right” for right-padding tokenizers
- ”cls” becomes “cls__left” for left-padding tokenizers
torch_dtype – The torch data type to use for the model. If None, uses model default.
trust_remote_code – Whether to trust remote code when loading the model.

Returns:

A tuple containing: - LlamaBidirectionalHFAdapter: The configured adapter model - AutoTokenizer: The tokenizer for the model

Return type:

tuple

.. rubric:: Example

model, tokenizer = get_llama_bidirectional_hf_model( … “sentence-transformers/all-MiniLM-L6-v2”, … normalize=True, … pooling_mode=”avg” … )

Use model and tokenizer for embedding generation

nemo_export.model_adapters.embedding.embedding_adapter#

Module Contents#

Classes#

Functions#

API#

`nemo_export.model_adapters.embedding.embedding_adapter`#