nemo_curator.stages.text.classifiers.domain

View as Markdown

Module Contents

Classes

NameDescription
DomainClassifierDomainClassifier is a specialized classifier designed for English text domain classification tasks,
MultilingualDomainClassifierMultilingualDomainClassifier is a specialized classifier designed for domain classification tasks,

Data

DOMAIN_MODEL_IDENTIFIER

MAX_SEQ_LENGTH

MULTILINGUAL_DOMAIN_MODEL_IDENTIFIER

API

class nemo_curator.stages.text.classifiers.domain.DomainClassifier(
cache_dir: str | None = None,
label_field: str = 'domain_pred',
score_field: str | None = None,
text_field: str = 'text',
filter_by: list[str] | None = None,
max_chars: int = 2000,
sort_by_length: bool = True,
model_inference_batch_size: int = 256,
autocast: bool = True,
keep_tokens: bool = False,
use_existing_tokens: bool = False
)

Bases: DistributedDataClassifier

DomainClassifier is a specialized classifier designed for English text domain classification tasks, utilizing the NemoCurator Domain Classifier (https://huggingface.co/nvidia/domain-classifier) model. This classifier is optimized for running on multi-node, multi-GPU setups to enable fast and efficient inference on large datasets.

name
= format_name_with_suffix(DOMAIN_MODEL_IDENTIFIER)
class nemo_curator.stages.text.classifiers.domain.MultilingualDomainClassifier(
cache_dir: str | None = None,
label_field: str = 'multilingual_domain_pred',
score_field: str | None = None,
text_field: str = 'text',
filter_by: list[str] | None = None,
max_chars: int = 2000,
sort_by_length: bool = True,
model_inference_batch_size: int = 256,
autocast: bool = True,
keep_tokens: bool = False,
use_existing_tokens: bool = False
)

Bases: DistributedDataClassifier

MultilingualDomainClassifier is a specialized classifier designed for domain classification tasks, utilizing the NemoCurator Multilingual Domain Classifier (https://huggingface.co/nvidia/multilingual-domain-classifier) model. It supports domain classification across 52 languages. This classifier is optimized for running on multi-node, multi-GPU setups to enable fast and efficient inference on large datasets.

name
nemo_curator.stages.text.classifiers.domain.DOMAIN_MODEL_IDENTIFIER = 'nvidia/domain-classifier'
nemo_curator.stages.text.classifiers.domain.MAX_SEQ_LENGTH = 512
nemo_curator.stages.text.classifiers.domain.MULTILINGUAL_DOMAIN_MODEL_IDENTIFIER = 'nvidia/multilingual-domain-classifier'