stages.text.embedders.base
#
Module Contents#
Classes#
Base class for high-level composite stages. |
|
HuggingFace model stage that produces embeddings with pooling. |
API#
- class stages.text.embedders.base.EmbeddingCreatorStage#
Bases:
nemo_curator.stages.base.CompositeStage
[nemo_curator.tasks.DocumentBatch
,nemo_curator.tasks.DocumentBatch
]Base class for high-level composite stages.
Composite stages are user-facing stages that decompose into multiple low-level execution stages during pipeline planning. They provide a simplified API while maintaining fine-grained control at execution time.
Composite stages never actually execute - they only exist to be decomposed into their constituent execution stages.
Initialization
- autocast: bool#
True
- decompose() list[nemo_curator.stages.base.ProcessingStage] #
Decompose into execution stages.
This method must be implemented by composite stages to define what low-level stages they represent.
Returns (list[ProcessingStage]): List of execution stages that will actually run
- embedding_field: str#
‘embeddings’
- embedding_pooling: Literal[mean_pooling, last_token]#
‘mean_pooling’
- hf_token: str | None#
None
- max_chars: int | None#
None
- max_seq_length: int | None#
None
- model_identifier: str#
‘sentence-transformers/all-MiniLM-L6-v2’
- model_inference_batch_size: int#
1024
- padding_side: Literal[left, right]#
‘right’
- sort_by_length: bool#
True
- text_field: str#
‘text’
- class stages.text.embedders.base.EmbeddingModelStage(
- model_identifier: str,
- embedding_field: str = 'embeddings',
- pooling: Literal[mean_pooling, last_token] = 'mean_pooling',
- hf_token: str | None = None,
- model_inference_batch_size: int = 1024,
- has_seq_order: bool = True,
- padding_side: Literal[left, right] = 'right',
- autocast: bool = True,
Bases:
nemo_curator.stages.text.models.model.ModelStage
HuggingFace model stage that produces embeddings with pooling.
Initialization
- collect_outputs(
- processed_outputs: list[torch.Tensor],
- create_output_dataframe(
- df_cpu: pandas.DataFrame,
- collected_output: list[list[float]],
Create output dataframe with embeddings.
- outputs() tuple[list[str], list[str]] #
Define stage output specification.
Returns (tuple[list[str], list[str]]): Tuple of (output_attributes, output_columns) where: - output_top_level_attributes: List of task attributes this stage adds/modifies - output_data_attributes: List of attributes within the data that this stage adds/modifies
- process_model_output(
- outputs: torch.Tensor,
- model_input_batch: dict[str, torch.Tensor] | None = None,
Process model outputs to create embeddings.
- setup(
- _: nemo_curator.backends.base.WorkerMetadata | None = None,
Load the model for inference.