`classifiers.domain`#

Module Contents#

Classes#

`DomainClassifier`	DomainClassifier is a specialized classifier designed for English text domain classification tasks, utilizing the NemoCurator Domain Classifier (https://huggingface.co/nvidia/domain-classifier) model. This classifier is optimized for running on multi-node, multi-GPU setups to enable fast and efficient inference on large datasets.
`DomainModel`
`DomainModelConfig`
`MultilingualDomainClassifier`	MultilingualDomainClassifier is a specialized classifier designed for domain classification tasks, utilizing the NemoCurator Multilingual Domain Classifier (https://huggingface.co/nvidia/multilingual-domain-classifier) model. It supports domain classification across 52 languages. This classifier is optimized for running on multi-node, multi-GPU setups to enable fast and efficient inference on large datasets.

Data#

`DOMAIN_BASE_MODEL`
`DOMAIN_IDENTIFIER`
`MULTILINGUAL_DOMAIN_BASE_MODEL`
`MULTILINGUAL_DOMAIN_IDENTIFIER`

API#

classifiers.domain.DOMAIN_BASE_MODEL#: ‘microsoft/deberta-v3-base’

classifiers.domain.DOMAIN_IDENTIFIER#: ‘nvidia/domain-classifier’

class classifiers.domain.DomainClassifier( filter_by: list[str] | None = None, batch_size: int = 256, text_field: str = 'text', pred_column: str = 'domain_pred', prob_column: str | None = None, max_chars: int = 2000, device_type: str = 'cuda', autocast: bool = True, max_mem_gb: int | None = None, )#

Bases: classifiers.domain._DomainClassifier

DomainClassifier is a specialized classifier designed for English text domain classification tasks, utilizing the NemoCurator Domain Classifier (https://huggingface.co/nvidia/domain-classifier) model. This classifier is optimized for running on multi-node, multi-GPU setups to enable fast and efficient inference on large datasets.

Attributes: filter_by (list[str], optional): The classes to filter the dataset by. If None, all classes will be included. Defaults to None. batch_size (int): The number of samples per batch for inference. Defaults to 256. text_field (str): The field in the dataset that should be classified. pred_column (str): The column name where predictions will be stored. Defaults to “domain_pred”. prob_column (str, optional): The column name where prediction probabilities will be stored. Defaults to None. max_chars (int): The maximum number of characters in each document to consider for classification. Defaults to 2000. device_type (str): The type of device to use for inference, either “cuda” or “cpu”. Defaults to “cuda”. autocast (bool): Whether to use mixed precision for faster inference. Defaults to True. max_mem_gb (int, optional): The maximum amount of memory in GB to allocate for the model. If None, it defaults to the available GPU memory minus 4 GB.

Initialization

Constructs a Module

Args: input_backend (Literal[“pandas”, “cudf”, “any”]): The backend the input dataframe must be on for the module to work name (str, Optional): The name of the module. If None, defaults to self.class.name

class classifiers.domain.DomainModel( config: classifiers.domain.DomainModelConfig, autocast: bool = False, max_mem_gb: int | None = None, )#

Bases: crossfit.backend.torch.hf.model.HFModel

Initialization

load_config() → transformers.AutoConfig#

load_model( device: str = 'cuda', ) → nemo_curator.classifiers.base.HFDeberta#

load_tokenizer() → transformers.AutoTokenizer#

class classifiers.domain.DomainModelConfig#

base_model: str#: None

fc_dropout: float#: 0.2

identifier: str#: None

max_len: int#: 512

classifiers.domain.MULTILINGUAL_DOMAIN_BASE_MODEL#: ‘microsoft/mdeberta-v3-base’

classifiers.domain.MULTILINGUAL_DOMAIN_IDENTIFIER#: ‘nvidia/multilingual-domain-classifier’

class classifiers.domain.MultilingualDomainClassifier( filter_by: list[str] | None = None, batch_size: int = 256, text_field: str = 'text', pred_column: str = 'domain_pred', prob_column: str | None = None, max_chars: int = 2000, device_type: str = 'cuda', autocast: bool = True, max_mem_gb: int | None = None, )#

Bases: classifiers.domain._DomainClassifier

MultilingualDomainClassifier is a specialized classifier designed for domain classification tasks, utilizing the NemoCurator Multilingual Domain Classifier (https://huggingface.co/nvidia/multilingual-domain-classifier) model. It supports domain classification across 52 languages. This classifier is optimized for running on multi-node, multi-GPU setups to enable fast and efficient inference on large datasets.

Attributes: filter_by (list[str], optional): The classes to filter the dataset by. If None, all classes will be included. Defaults to None. batch_size (int): The number of samples per batch for inference. Defaults to 256. text_field (str): The field in the dataset that should be classified. pred_column (str): The column name where predictions will be stored. Defaults to “domain_pred”. prob_column (str, optional): The column name where prediction probabilities will be stored. Defaults to None. max_chars (int): The maximum number of characters in each document to consider for classification. Defaults to 2000. device_type (str): The type of device to use for inference, either “cuda” or “cpu”. Defaults to “cuda”. autocast (bool): Whether to use mixed precision for faster inference. Defaults to True. max_mem_gb (int, optional): The maximum amount of memory in GB to allocate for the model. If None, it defaults to the available GPU memory minus 4 GB.

Initialization

Constructs a Module

Args: input_backend (Literal[“pandas”, “cudf”, “any”]): The backend the input dataframe must be on for the module to work name (str, Optional): The name of the module. If None, defaults to self.class.name

classifiers.domain#

Module Contents#

Classes#

Data#

API#

`classifiers.domain`#