`stages.text.classifiers.base`#

Module Contents#

Classes#

`ClassifierModelStage`	Stage for Hugging Face model inference.
`Deberta`	Base PyTorch model where we add a classification head.
`DistributedDataClassifier`	Base composite stage for distributed data classification.

API#

class stages.text.classifiers.base.ClassifierModelStage( model_identifier: str, cache_dir: str | None = None, pred_column: str = 'preds', prob_column: str | None = None, model_inference_batch_size: int = 256, has_seq_order: bool = True, padding_side: Literal[left, right] = 'right', autocast: bool = True, )#

Bases: nemo_curator.stages.text.models.model.ModelStage

Stage for Hugging Face model inference.

Args: model_identifier: The identifier of the Hugging Face model. pred_column: The name of the prediction column. prob_column: The name of the probability column. Defaults to None. model_inference_batch_size: The size of the batch for model inference. Defaults to 256. has_seq_order: Whether to sort the input data by the length of the input tokens. Sorting is encouraged to improve the performance of the inference model. Defaults to True. padding_side: The side to pad the input tokens. Defaults to “right”. autocast: Whether to use autocast. When True, we trade off minor accuracy for faster inference. Defaults to True.

Initialization

create_output_dataframe( df_cpu: pandas.DataFrame, collected_output: dict[str, numpy.ndarray], ) → pandas.DataFrame#

outputs() → tuple[list[str], list[str]]#

Define stage output specification.

Returns (tuple[list[str], list[str]]): Tuple of (output_attributes, output_columns) where: - output_top_level_attributes: List of task attributes this stage adds/modifies - output_data_attributes: List of attributes within the data that this stage adds/modifies

process_model_output( outputs: torch.Tensor, _: dict[str, torch.Tensor] | None = None, ) → dict[str, numpy.ndarray]#

class stages.text.classifiers.base.Deberta(config: dataclasses.dataclass)#

Bases: torch.nn.Module, huggingface_hub.PyTorchModelHubMixin

Base PyTorch model where we add a classification head.

Args: config: The configuration of the model.

Initialization

property device: torch.device#

forward(batch: dict[str, torch.Tensor]) → torch.Tensor#

class stages.text.classifiers.base.DistributedDataClassifier#

Bases: nemo_curator.stages.base.CompositeStage[nemo_curator.tasks.DocumentBatch, nemo_curator.tasks.DocumentBatch]

Base composite stage for distributed data classification.

It decomposes into a tokenizer stage and a model stage.

Args: model_identifier: The identifier of the Hugging Face model. cache_dir: The Hugging Face cache directory. Defaults to None. pred_column: The name of the prediction column. Defaults to “preds”. prob_column: The name of the probability column. Defaults to None. text_field: The name of the text field in the input data. Defaults to “text”. filter_by: For categorical classifiers, the list of labels to filter the data by. Defaults to None. max_chars: Limits the total number of characters that can be fed to the tokenizer. If None, text will not be truncated. Defaults to None. max_seq_length: Limits the total sequence returned by the tokenizer so that it has a maximum length. If None, the tokenizer’s model_max_length is used. Defaults to 512. padding_side: The side to pad the input tokens. Defaults to “right”. sort_by_length: Whether to sort the input data by the length of the input tokens. Sorting is encouraged to improve the performance of the inference model. Defaults to True. model_inference_batch_size: The size of the batch for model inference. Defaults to 256. autocast: Whether to use autocast. When True, we trade off minor accuracy for faster inference. Defaults to True.

Initialization

autocast: bool#: True

cache_dir: str | None#: None

decompose() → list[nemo_curator.stages.base.ProcessingStage]#

Decompose into execution stages.

This method must be implemented by composite stages to define what low-level stages they represent.

Returns (list[ProcessingStage]): List of execution stages that will actually run

filter_by: list[str] | None#: None

filter_by_category(value: str) → bool#

inputs() → tuple[list[str], list[str]]#: Get the inputs for this stage.

max_chars: int | None#: None

max_seq_length: int | None#: None

model_identifier: str#: None

model_inference_batch_size: int#: 256

outputs() → tuple[list[str], list[str]]#: Get the outputs for this stage.

padding_side: Literal[left, right]#: ‘right’

pred_column: str#: ‘preds’

prob_column: str | None#: None

sort_by_length: bool#: True

text_field: str#: ‘text’

stages.text.classifiers.base#

Module Contents#

Classes#

API#

`stages.text.classifiers.base`#