classifiers.base#

Module Contents#

Classes#

DistributedDataClassifier

Abstract class for running multi-node multi-GPU data classification

HFDeberta

API#

class classifiers.base.DistributedDataClassifier(
model: str,
labels: list[str] | None,
filter_by: list[str] | None,
batch_size: int,
out_dim: int | None,
pred_column: str | list[str],
max_chars: int,
device_type: str,
autocast: bool,
)#

Bases: nemo_curator.modules.base.BaseModule

Abstract class for running multi-node multi-GPU data classification

Initialization

Constructs a Module

Args: input_backend (Literal[“pandas”, “cudf”, “any”]): The backend the input dataframe must be on for the module to work name (str, Optional): The name of the module. If None, defaults to self.class.name

call(
dataset: nemo_curator.datasets.DocumentDataset,
) nemo_curator.datasets.DocumentDataset#

Performs an arbitrary operation on a dataset

Args: dataset (DocumentDataset): The dataset to operate on

get_labels() list[str]#
class classifiers.base.HFDeberta(config: dataclasses.dataclass)#

Bases: torch.nn.Module, huggingface_hub.PyTorchModelHubMixin

Initialization

forward(batch: dict[str, torch.Tensor]) torch.Tensor#
set_autocast(autocast: bool) None#