classifiers.content_type#

Module Contents#

Classes#

ContentTypeClassifier

ContentTypeClassifier is a text classification model designed to categorize documents into one of 11 distinct speech types based on their content. It analyzes and understands the nuances of textual information, enabling accurate classification across a diverse range of content types. The pretrained model used by this class is called NemoCurator Content Type Classifier DeBERTa. It can be found on Hugging Face here: https://huggingface.co/nvidia/content-type-classifier-deberta. This classifier is optimized for running on multi-node, multi-GPU setups to enable fast and efficient inference on large datasets.

ContentTypeModel

ContentTypeModelConfig

Data#

API#

classifiers.content_type.CONTENT_TYPE_IDENTIFIER#

‘nvidia/content-type-classifier-deberta’

class classifiers.content_type.ContentTypeClassifier(
filter_by: list[str] | None = None,
batch_size: int = 256,
text_field: str = 'text',
pred_column: str = 'content_pred',
prob_column: str | None = None,
max_chars: int = 5000,
device_type: str = 'cuda',
autocast: bool = True,
max_mem_gb: int | None = None,
)#

Bases: nemo_curator.classifiers.base.DistributedDataClassifier

ContentTypeClassifier is a text classification model designed to categorize documents into one of 11 distinct speech types based on their content. It analyzes and understands the nuances of textual information, enabling accurate classification across a diverse range of content types. The pretrained model used by this class is called NemoCurator Content Type Classifier DeBERTa. It can be found on Hugging Face here: https://huggingface.co/nvidia/content-type-classifier-deberta. This classifier is optimized for running on multi-node, multi-GPU setups to enable fast and efficient inference on large datasets.

Attributes: filter_by (list[str], optional): The classes to filter the dataset by. If None, all classes will be included. Defaults to None. batch_size (int): The number of samples per batch for inference. Defaults to 256. text_field (str): The field in the dataset that should be classified. pred_column (str): The column name where predictions will be stored. Defaults to “content_pred”. prob_column (str, optional): The column name where prediction probabilities will be stored. Defaults to None. max_chars (int): The maximum number of characters in each document to consider for classification. Defaults to 5000. device_type (str): The type of device to use for inference, either “cuda” or “cpu”. Defaults to “cuda”. autocast (bool): Whether to use mixed precision for faster inference. Defaults to True. max_mem_gb (int, optional): The maximum amount of memory in GB to allocate for the model. If None, it defaults to the available GPU memory minus 4 GB.

Initialization

Constructs a Module

Args: input_backend (Literal[“pandas”, “cudf”, “any”]): The backend the input dataframe must be on for the module to work name (str, Optional): The name of the module. If None, defaults to self.class.name

class classifiers.content_type.ContentTypeModel(
config: classifiers.content_type.ContentTypeModelConfig,
autocast: bool = False,
max_mem_gb: int | None = None,
)#

Bases: crossfit.backend.torch.hf.model.HFModel

Initialization

load_config() transformers.AutoConfig#
load_model(
device: str = 'cuda',
) nemo_curator.classifiers.base.HFDeberta#
load_tokenizer() transformers.AutoTokenizer#
class classifiers.content_type.ContentTypeModelConfig#
fc_dropout: float#

0.2

max_len: int#

1024

model: str#

‘microsoft/deberta-v3-base’