`stages.text.classifiers.domain`#

Module Contents#

Classes#

DomainClassifier

DomainClassifier is a specialized classifier designed for English text domain classification tasks, utilizing the NemoCurator Domain Classifier (https://huggingface.co/nvidia/domain-classifier) model. This classifier is optimized for running on multi-node, multi-GPU setups to enable fast and efficient inference on large datasets.

MultilingualDomainClassifier

MultilingualDomainClassifier is a specialized classifier designed for domain classification tasks, utilizing the NemoCurator Multilingual Domain Classifier (https://huggingface.co/nvidia/multilingual-domain-classifier) model. It supports domain classification across 52 languages. This classifier is optimized for running on multi-node, multi-GPU setups to enable fast and efficient inference on large datasets.

Data#

`DOMAIN_MODEL_IDENTIFIER`
`MAX_SEQ_LENGTH`
`MULTILINGUAL_DOMAIN_MODEL_IDENTIFIER`

API#

stages.text.classifiers.domain.DOMAIN_MODEL_IDENTIFIER#: ‘nvidia/domain-classifier’

class stages.text.classifiers.domain.DomainClassifier( cache_dir: str | None = None, label_field: str = 'domain_pred', score_field: str | None = None, text_field: str = 'text', filter_by: list[str] | None = None, max_chars: int = 2000, sort_by_length: bool = True, model_inference_batch_size: int = 256, autocast: bool = True, )#

Bases: stages.text.classifiers.base.DistributedDataClassifier

DomainClassifier is a specialized classifier designed for English text domain classification tasks, utilizing the NemoCurator Domain Classifier (https://huggingface.co/nvidia/domain-classifier) model. This classifier is optimized for running on multi-node, multi-GPU setups to enable fast and efficient inference on large datasets.

Attributes: cache_dir: The Hugging Face cache directory. Defaults to None. label_field: The name of the prediction column. Defaults to “domain_pred”. score_field: The name of the probability column. Defaults to None. text_field: The name of the text field in the input data. Defaults to “text”. filter_by: For categorical classifiers, the list of labels to filter the data by. Defaults to None. max_chars: The maximum number of characters to use from the input text. Defaults to 2000. sort_by_length: Whether to sort the input data by the length of the input tokens. Sorting is encouraged to improve the performance of the inference model. Defaults to True. model_inference_batch_size: The size of the batch for model inference. Defaults to 256. autocast: Whether to use autocast. When True, we trade off minor accuracy for faster inference. Defaults to True.

Initialization

stages.text.classifiers.domain.MAX_SEQ_LENGTH#: 512

stages.text.classifiers.domain.MULTILINGUAL_DOMAIN_MODEL_IDENTIFIER#: ‘nvidia/multilingual-domain-classifier’

class stages.text.classifiers.domain.MultilingualDomainClassifier( cache_dir: str | None = None, label_field: str = 'multilingual_domain_pred', score_field: str | None = None, text_field: str = 'text', filter_by: list[str] | None = None, max_chars: int = 2000, sort_by_length: bool = True, model_inference_batch_size: int = 256, autocast: bool = True, )#

Bases: stages.text.classifiers.base.DistributedDataClassifier

MultilingualDomainClassifier is a specialized classifier designed for domain classification tasks, utilizing the NemoCurator Multilingual Domain Classifier (https://huggingface.co/nvidia/multilingual-domain-classifier) model. It supports domain classification across 52 languages. This classifier is optimized for running on multi-node, multi-GPU setups to enable fast and efficient inference on large datasets.

Attributes: cache_dir: The Hugging Face cache directory. Defaults to None. label_field: The name of the prediction column. Defaults to “multilingual_domain_pred”. score_field: The name of the probability column. Defaults to None. text_field: The name of the text field in the input data. Defaults to “text”. filter_by: For categorical classifiers, the list of labels to filter the data by. Defaults to None. max_chars: The maximum number of characters to use from the input text. Defaults to 2000. sort_by_length: Whether to sort the input data by the length of the input tokens. Sorting is encouraged to improve the performance of the inference model. Defaults to True. model_inference_batch_size: The size of the batch for model inference. Defaults to 256. autocast: Whether to use autocast. When True, we trade off minor accuracy for faster inference. Defaults to True.

Initialization

stages.text.classifiers.domain#

Module Contents#

Classes#

Data#

API#

`stages.text.classifiers.domain`#