`stages.text.embedders.base`#

Module Contents#

Classes#

`EmbeddingCreatorStage`	Base class for high-level composite stages.
`EmbeddingModelStage`	HuggingFace model stage that produces embeddings with pooling.

API#

class stages.text.embedders.base.EmbeddingCreatorStage#

Bases: nemo_curator.stages.base.CompositeStage[nemo_curator.tasks.DocumentBatch, nemo_curator.tasks.DocumentBatch]

Base class for high-level composite stages.

Composite stages are user-facing stages that decompose into multiple low-level execution stages during pipeline planning. They provide a simplified API while maintaining fine-grained control at execution time.

Composite stages never actually execute - they only exist to be decomposed into their constituent execution stages.

Initialization

autocast: bool#: True

decompose() → list[nemo_curator.stages.base.ProcessingStage]#

Decompose into execution stages.

This method must be implemented by composite stages to define what low-level stages they represent.

Returns (list[ProcessingStage]): List of execution stages that will actually run

embedding_field: str#: ‘embeddings’

embedding_pooling: Literal[mean_pooling, last_token]#: ‘mean_pooling’

hf_token: str | None#: None

max_chars: int | None#: None

max_seq_length: int | None#: None

model_identifier: str#: ‘sentence-transformers/all-MiniLM-L6-v2’

model_inference_batch_size: int#: 1024

padding_side: Literal[left, right]#: ‘right’

sort_by_length: bool#: True

text_field: str#: ‘text’

class stages.text.embedders.base.EmbeddingModelStage( model_identifier: str, embedding_field: str = 'embeddings', pooling: Literal[mean_pooling, last_token] = 'mean_pooling', hf_token: str | None = None, model_inference_batch_size: int = 1024, has_seq_order: bool = True, padding_side: Literal[left, right] = 'right', autocast: bool = True, )#

Bases: nemo_curator.stages.text.models.model.ModelStage

HuggingFace model stage that produces embeddings with pooling.

Initialization

collect_outputs( processed_outputs: list[torch.Tensor], ) → list[list[float]]#

create_output_dataframe( df_cpu: pandas.DataFrame, collected_output: list[list[float]], ) → pandas.DataFrame#: Create output dataframe with embeddings.

outputs() → tuple[list[str], list[str]]#

Define stage output specification.

Returns (tuple[list[str], list[str]]): Tuple of (output_attributes, output_columns) where: - output_top_level_attributes: List of task attributes this stage adds/modifies - output_data_attributes: List of attributes within the data that this stage adds/modifies

process_model_output( outputs: torch.Tensor, model_input_batch: dict[str, torch.Tensor] | None = None, ) → torch.Tensor#: Process model outputs to create embeddings.

setup( _: nemo_curator.backends.base.WorkerMetadata | None = None, ) → None#: Load the model for inference.

stages.text.embedders.base#

Module Contents#

Classes#

API#

`stages.text.embedders.base`#