nemo_curator.stages.text.classifiers.base
nemo_curator.stages.text.classifiers.base
Module Contents
Classes
API
Bases: ModelStage
Stage for Hugging Face model inference.
Parameters:
The identifier of the Hugging Face model.
The name of the prediction column.
The name of the probability column. Defaults to None.
The size of the batch for model inference. Defaults to 256.
Whether to sort the input data by the length of the input tokens. Sorting is encouraged to improve the performance of the inference model. Defaults to True.
The side to pad the input tokens. Defaults to “right”.
If provided, clips the input tokens before the forward pass. Defaults to None.
Whether to use autocast. When True, we trade off minor accuracy for faster inference. Defaults to True.
Whether to keep the input tokens in the output dataframe. Defaults to False.
Bases: Module, PyTorchModelHubMixin
Base PyTorch model where we add a classification head.
Parameters:
The configuration of the model.
Bases: CompositeStage[DocumentBatch, DocumentBatch]
Base composite stage for distributed data classification.
It decomposes into a tokenizer stage and a model stage.
Parameters:
The identifier of the Hugging Face model.
The Hugging Face cache directory. Defaults to None.
The name of the prediction column. Defaults to “preds”.
The name of the probability column. Defaults to None.
The name of the text field in the input data. Defaults to “text”.
For categorical classifiers, the list of labels to filter the data by. Defaults to None.
Limits the total number of characters that can be fed to the tokenizer. If None, text will not be truncated. Defaults to None.
Limits the total sequence returned by the tokenizer so that it has a maximum length. If None, the tokenizer’s model_max_length is used. Defaults to 512.
The side to pad the input tokens. Defaults to “right”.
Whether to sort the input data by the length of the input tokens. Sorting is encouraged to improve the performance of the inference model. Defaults to True.
The size of the batch for model inference. Defaults to 256.
Whether to use autocast. When True, we trade off minor accuracy for faster inference. Defaults to True.
Whether to keep the input tokens in the output dataframe. Defaults to False.
Whether to use the existing tokens from the input dataframe. If True, assume the relevant token fields are [“input_ids”, “attention_mask”] and skip tokenization. Defaults to False.