nemo_curator.stages.text.classifiers.content_type

View as Markdown

Module Contents

Classes

NameDescription
ContentTypeClassifierContentTypeClassifier is a text classification model designed to categorize documents into one of 11 distinct speech types based on their content.

Data

CONTENT_TYPE_MODEL_IDENTIFIER

MAX_SEQ_LENGTH

API

class nemo_curator.stages.text.classifiers.content_type.ContentTypeClassifier(
cache_dir: str | None = None,
label_field: str = 'content_pred',
score_field: str | None = None,
text_field: str = 'text',
filter_by: list[str] | None = None,
max_chars: int = 6000,
sort_by_length: bool = True,
model_inference_batch_size: int = 256,
autocast: bool = True,
keep_tokens: bool = False,
use_existing_tokens: bool = False
)

Bases: DistributedDataClassifier

ContentTypeClassifier is a text classification model designed to categorize documents into one of 11 distinct speech types based on their content. It analyzes and understands the nuances of textual information, enabling accurate classification across a diverse range of content types. The pretrained model used by this class is called NemoCurator Content Type Classifier DeBERTa. It can be found on Hugging Face here: https://huggingface.co/nvidia/content-type-classifier-deberta. This classifier is optimized for running on multi-node, multi-GPU setups to enable fast and efficient inference on large datasets.

name
nemo_curator.stages.text.classifiers.content_type.CONTENT_TYPE_MODEL_IDENTIFIER = 'nvidia/content-type-classifier-deberta'
nemo_curator.stages.text.classifiers.content_type.MAX_SEQ_LENGTH = 1024