nemo_export.model_adapters.embedding.embedding_adapter#

Module Contents#

Classes#

LlamaBidirectionalHFAdapter

Wraps a text embedding model with pooling and normalization for bidirectional encoding.

Pooling

Pooling layer that aggregates token-level embeddings into sequence-level embeddings.

Functions#

get_llama_bidirectional_hf_model

Factory function to create a LlamaBidirectionalHFAdapter with proper configuration.

API#

class nemo_export.model_adapters.embedding.embedding_adapter.LlamaBidirectionalHFAdapter(
model: torch.nn.Module,
normalize: bool,
pooling_module: torch.nn.Module,
)#

Bases: torch.nn.Module

Wraps a text embedding model with pooling and normalization for bidirectional encoding.

This adapter combines a transformer model with configurable pooling strategies and optional L2 normalization to produce fixed-size embeddings from variable-length text sequences. It supports dimension reduction and various pooling methods including average, CLS token, and last token pooling.

Parameters:
  • model – The underlying transformer model (e.g., AutoModel from HuggingFace).

  • normalize – Whether to apply L2 normalization to the output embeddings.

  • pooling_module – The pooling module to use for aggregating token embeddings.

Initialization

Initialize the LlamaBidirectionalHFAdapter.

Parameters:
  • model – The transformer model to wrap.

  • normalize – If True, applies L2 normalization to output embeddings.

  • pooling_module – Module that handles pooling of token embeddings.

property device: torch.device#

Returns the device of the underlying model.

Returns:

The device where the model parameters are located.

Return type:

torch.device

forward(
input_ids: torch.Tensor,
attention_mask: torch.Tensor,
token_type_ids: Optional[torch.Tensor] = None,
dimensions: Optional[torch.Tensor] = None,
) torch.Tensor#

Forward pass through the adapted model to generate embeddings.

Parameters:
  • input_ids – Token IDs of shape (batch_size, sequence_length).

  • attention_mask – Attention mask of shape (batch_size, sequence_length).

  • token_type_ids – Optional token type IDs for models that use them.

  • dimensions – Optional tensor specifying the desired output dimensions for each sample in the batch. If provided, embeddings will be truncated/masked to these dimensions.

Returns:

Pooled and optionally normalized embeddings of shape (batch_size, embedding_dim) or (batch_size, max_dimensions) if dimensions parameter is used.

Return type:

torch.Tensor

Raises:

ValueError – If dimensions contain non-positive values.

class nemo_export.model_adapters.embedding.embedding_adapter.Pooling(pooling_mode: str)#

Bases: torch.nn.Module

Pooling layer that aggregates token-level embeddings into sequence-level embeddings.

Supports multiple pooling strategies:

  • ‘avg’: Average pooling over non-padded tokens

  • ‘cls’: Uses the first token (CLS token) with right padding

  • ‘cls__left’: Uses the first non-padded token with left padding

  • ‘last’: Uses the last token with left padding

  • ‘last__right’: Uses the last non-padded token with right padding

Parameters:

pooling_mode – The pooling strategy to use.

Initialization

Initialize the Pooling layer.

Parameters:

pooling_mode – The pooling strategy. Must be one of: ‘avg’, ‘cls’, ‘cls__left’, ‘last’, ‘last__right’.

forward(
last_hidden_states: torch.Tensor,
attention_mask: torch.Tensor,
) torch.Tensor#

Apply pooling to the hidden states.

Parameters:
  • last_hidden_states – Hidden states from the transformer model of shape (batch_size, sequence_length, hidden_size).

  • attention_mask – Attention mask of shape (batch_size, sequence_length) where 1 indicates real tokens and 0 indicates padding.

Returns:

Pooled embeddings of shape (batch_size, hidden_size).

Return type:

torch.Tensor

Raises:

ValueError – If the pooling_mode is not supported.

nemo_export.model_adapters.embedding.embedding_adapter.get_llama_bidirectional_hf_model(
model_name_or_path: Union[str, os.PathLike[str]],
normalize: bool,
pooling_mode: Optional[Literal[avg, cls, last]] = None,
torch_dtype: Optional[Union[torch.dtype, str]] = None,
trust_remote_code: bool = False,
)#

Factory function to create a LlamaBidirectionalHFAdapter with proper configuration.

This function loads a HuggingFace transformer model and tokenizer, configures the appropriate pooling strategy based on the tokenizer’s padding side, and wraps everything in a LlamaBidirectionalHFAdapter.

Special handling is provided for NVEmbedModel which has separate embedding and latent attention components.

Parameters:
  • model_name_or_path – Path to the model directory or HuggingFace model identifier.

  • normalize – Whether to apply L2 normalization to the output embeddings.

  • pooling_mode

    The pooling strategy to use. If None, defaults to “avg”. Will be automatically adjusted based on tokenizer padding side:

    • ”last” becomes “last__right” for right-padding tokenizers

    • ”cls” becomes “cls__left” for left-padding tokenizers

  • torch_dtype – The torch data type to use for the model. If None, uses model default.

  • trust_remote_code – Whether to trust remote code when loading the model.

Returns:

A tuple containing: - LlamaBidirectionalHFAdapter: The configured adapter model - AutoTokenizer: The tokenizer for the model

Return type:

tuple

.. rubric:: Example

model, tokenizer = get_llama_bidirectional_hf_model( … “sentence-transformers/all-MiniLM-L6-v2”, … normalize=True, … pooling_mode=”avg” … )

Use model and tokenizer for embedding generation