classifiers.quality
#
Module Contents#
Classes#
QualityClassifier is a specialized classifier designed for quality assessment tasks, utilizing the NemoCurator Quality Classifier DeBERTa model (https://huggingface.co/nvidia/quality-classifier-deberta). This classifier is optimized for running on multi-node, multi-GPU setups to enable fast and efficient inference on large datasets. |
|
Data#
API#
- classifiers.quality.QUALITY_IDENTIFIER#
‘nvidia/quality-classifier-deberta’
- class classifiers.quality.QualityClassifier(
- filter_by: list[str] | None = None,
- batch_size: int = 256,
- text_field: str = 'text',
- pred_column: str = 'quality_pred',
- prob_column: str = 'quality_prob',
- max_chars: int = 6000,
- device_type: str = 'cuda',
- autocast: bool = True,
- max_mem_gb: int | None = None,
Bases:
nemo_curator.classifiers.base.DistributedDataClassifier
QualityClassifier is a specialized classifier designed for quality assessment tasks, utilizing the NemoCurator Quality Classifier DeBERTa model (https://huggingface.co/nvidia/quality-classifier-deberta). This classifier is optimized for running on multi-node, multi-GPU setups to enable fast and efficient inference on large datasets.
Attributes: filter_by (list[str], optional): The classes to filter the dataset by. If None, all classes will be included. Defaults to None. batch_size (int): The number of samples per batch for inference. Defaults to 256. text_field (str): The field in the dataset that should be classified. pred_column (str): The column name where predictions will be stored. Defaults to “quality_pred”. prob_column (str): The column name where prediction probabilities will be stored. Defaults to “quality_prob”. max_chars (int): The maximum number of characters in each document to consider for classification. Defaults to 6000. device_type (str): The type of device to use for inference, either “cuda” or “cpu”. Defaults to “cuda”. autocast (bool): Whether to use mixed precision for faster inference. Defaults to True. max_mem_gb (int, optional): The maximum amount of memory in GB to allocate for the model. If None, it defaults to the available GPU memory minus 4 GB.
Initialization
Constructs a Module
Args: input_backend (Literal[“pandas”, “cudf”, “any”]): The backend the input dataframe must be on for the module to work name (str, Optional): The name of the module. If None, defaults to self.class.name
- class classifiers.quality.QualityModel(
- config: classifiers.quality.QualityModelConfig,
- autocast: bool = False,
- max_mem_gb: int | None = None,
Bases:
crossfit.backend.torch.hf.model.HFModel
Initialization
- load_config() transformers.AutoConfig #
- load_model(
- device: str = 'cuda',
- load_tokenizer() transformers.AutoTokenizer #