nemo_curator.stages.text.classifiers.base

View as Markdown

Module Contents

Classes

NameDescription
ClassifierModelStageStage for Hugging Face model inference.
DebertaBase PyTorch model where we add a classification head.
DistributedDataClassifierBase composite stage for distributed data classification.

API

class nemo_curator.stages.text.classifiers.base.ClassifierModelStage(
model_identifier: str,
cache_dir: str | None = None,
label_field: str = 'preds',
score_field: str | None = None,
model_inference_batch_size: int = 256,
has_seq_order: bool = True,
padding_side: typing.Literal['left', 'right'] = 'right',
max_seq_length: int | None = None,
autocast: bool = True,
keep_tokens: bool = False
)

Bases: ModelStage

Stage for Hugging Face model inference.

Parameters:

model_identifier
str

The identifier of the Hugging Face model.

label_field
strDefaults to 'preds'

The name of the prediction column.

score_field
str | NoneDefaults to None

The name of the probability column. Defaults to None.

model_inference_batch_size
intDefaults to 256

The size of the batch for model inference. Defaults to 256.

has_seq_order
boolDefaults to True

Whether to sort the input data by the length of the input tokens. Sorting is encouraged to improve the performance of the inference model. Defaults to True.

padding_side
Literal['left', 'right']Defaults to 'right'

The side to pad the input tokens. Defaults to “right”.

max_seq_length
int | NoneDefaults to None

If provided, clips the input tokens before the forward pass. Defaults to None.

autocast
boolDefaults to True

Whether to use autocast. When True, we trade off minor accuracy for faster inference. Defaults to True.

keep_tokens
boolDefaults to False

Whether to keep the input tokens in the output dataframe. Defaults to False.

nemo_curator.stages.text.classifiers.base.ClassifierModelStage._setup(
local_files_only: bool = True
) -> None
nemo_curator.stages.text.classifiers.base.ClassifierModelStage.create_output_dataframe(
df_cpu: pandas.DataFrame,
collected_output: dict[str, numpy.ndarray]
) -> pandas.DataFrame
nemo_curator.stages.text.classifiers.base.ClassifierModelStage.outputs() -> tuple[list[str], list[str]]
nemo_curator.stages.text.classifiers.base.ClassifierModelStage.process_model_output(
outputs: torch.Tensor,
_: dict[str, torch.Tensor] | None = None
) -> dict[str, numpy.ndarray]
class nemo_curator.stages.text.classifiers.base.Deberta(
config: dataclasses.dataclass
)

Bases: Module, PyTorchModelHubMixin

Base PyTorch model where we add a classification head.

Parameters:

config
dataclass

The configuration of the model.

device
device
dropout
= nn.Dropout(config['fc_dropout'])
fc
model
= AutoModel.from_pretrained(config['base_model'])
nemo_curator.stages.text.classifiers.base.Deberta.forward(
batch: dict[str, torch.Tensor]
) -> torch.Tensor
class nemo_curator.stages.text.classifiers.base.DistributedDataClassifier(
model_identifier: str,
cache_dir: str | None = None,
label_field: str = 'preds',
score_field: str | None = None,
text_field: str = 'text',
filter_by: list[str] | None = None,
max_chars: int | None = None,
max_seq_length: int | None = None,
padding_side: typing.Literal['left', 'right'] = 'right',
sort_by_length: bool = True,
model_inference_batch_size: int = 256,
autocast: bool = True,
keep_tokens: bool = False,
use_existing_tokens: bool = False
)
Dataclass

Bases: CompositeStage[DocumentBatch, DocumentBatch]

Base composite stage for distributed data classification.

It decomposes into a tokenizer stage and a model stage.

Parameters:

model_identifier
str

The identifier of the Hugging Face model.

cache_dir
str | NoneDefaults to None

The Hugging Face cache directory. Defaults to None.

label_field
strDefaults to 'preds'

The name of the prediction column. Defaults to “preds”.

score_field
str | NoneDefaults to None

The name of the probability column. Defaults to None.

text_field
strDefaults to 'text'

The name of the text field in the input data. Defaults to “text”.

filter_by
list[str] | NoneDefaults to None

For categorical classifiers, the list of labels to filter the data by. Defaults to None.

max_chars
int | NoneDefaults to None

Limits the total number of characters that can be fed to the tokenizer. If None, text will not be truncated. Defaults to None.

max_seq_length
int | NoneDefaults to None

Limits the total sequence returned by the tokenizer so that it has a maximum length. If None, the tokenizer’s model_max_length is used. Defaults to 512.

padding_side
Literal['left', 'right']Defaults to 'right'

The side to pad the input tokens. Defaults to “right”.

sort_by_length
boolDefaults to True

Whether to sort the input data by the length of the input tokens. Sorting is encouraged to improve the performance of the inference model. Defaults to True.

model_inference_batch_size
intDefaults to 256

The size of the batch for model inference. Defaults to 256.

autocast
boolDefaults to True

Whether to use autocast. When True, we trade off minor accuracy for faster inference. Defaults to True.

keep_tokens
boolDefaults to False

Whether to keep the input tokens in the output dataframe. Defaults to False.

use_existing_tokens
boolDefaults to False

Whether to use the existing tokens from the input dataframe. If True, assume the relevant token fields are [“input_ids”, “attention_mask”] and skip tokenization. Defaults to False.

autocast
bool = True
cache_dir
str | None = None
filter_by
list[str] | None = None
keep_tokens
bool = False
label_field
str = 'preds'
max_chars
int | None = None
max_seq_length
int | None = None
model_identifier
str
model_inference_batch_size
int = 256
padding_side
Literal['left', 'right'] = 'right'
score_field
str | None = None
sort_by_length
bool = True
text_field
str = 'text'
use_existing_tokens
bool = False
nemo_curator.stages.text.classifiers.base.DistributedDataClassifier.__post_init__() -> None
nemo_curator.stages.text.classifiers.base.DistributedDataClassifier.decompose() -> list[nemo_curator.stages.base.ProcessingStage]
nemo_curator.stages.text.classifiers.base.DistributedDataClassifier.filter_by_category(
value: str
) -> bool
nemo_curator.stages.text.classifiers.base.DistributedDataClassifier.inputs() -> tuple[list[str], list[str]]
nemo_curator.stages.text.classifiers.base.DistributedDataClassifier.outputs() -> tuple[list[str], list[str]]