classifiers.fineweb_edu#
Module Contents#
Classes#
| FineWebEduClassifier is a specialized classifier designed for educational content assessment, utilizing the Hugging Face FineWeb EDU Classifier model (https://huggingface.co/HuggingFaceFW/fineweb-edu-classifier). This classifier is optimized for running on multi-node, multi-GPU setups to enable fast and efficient inference on large text datasets. | |
| FineWebMixtralEduClassifier is a specialized classifier designed for educational content assessment, utilizing the NemoCurator FineWeb Mixtral Edu Classifier model (https://huggingface.co/nvidia/nemocurator-fineweb-mixtral-edu-classifier). It is similar to the FineWeb-Edu classifier and was trained on the same text samples, but using annotations from Mixtral 8x22B-Instruct. This classifier is optimized for running on multi-node, multi-GPU setups to enable fast and efficient inference on large text datasets. | |
| FineWebNemotronEduClassifier is a specialized classifier designed for educational content assessment, utilizing the NemoCurator FineWeb Nemotron-4 Edu Classifier model (https://huggingface.co/nvidia/nemocurator-fineweb-nemotron-4-edu-classifier). It is similar to the FineWeb-Edu classifier and was trained on the same text samples, but using annotations from Nemotron-4-340B-Instruct. This classifier is optimized for running on multi-node, multi-GPU setups to enable fast and efficient inference on large text datasets. | |
Data#
API#
- classifiers.fineweb_edu.FINEWEB_EDU_IDENTIFIER#
- ‘HuggingFaceFW/fineweb-edu-classifier’ 
- classifiers.fineweb_edu.FINEWEB_MIXTRAL_IDENTIFIER#
- ‘nvidia/nemocurator-fineweb-mixtral-edu-classifier’ 
- classifiers.fineweb_edu.FINEWEB_NEMOTRON_IDENTIFIER#
- ‘nvidia/nemocurator-fineweb-nemotron-4-edu-classifier’ 
- class classifiers.fineweb_edu.FineWebEduClassifier(
- batch_size: int = 256,
- text_field: str = 'text',
- pred_column: str = 'fineweb-edu-score',
- int_column: str = 'fineweb-edu-score-int',
- max_chars: int = -1,
- device_type: str = 'cuda',
- autocast: bool = True,
- max_mem_gb: int | None = None,
- Bases: - classifiers.fineweb_edu._FineWebBaseClassifier- FineWebEduClassifier is a specialized classifier designed for educational content assessment, utilizing the Hugging Face FineWeb EDU Classifier model (https://huggingface.co/HuggingFaceFW/fineweb-edu-classifier). This classifier is optimized for running on multi-node, multi-GPU setups to enable fast and efficient inference on large text datasets. - Attributes: batch_size (int): The number of samples per batch for inference. Defaults to 256. text_field (str): The column name containing the text data to be classified. Defaults to “text”. pred_column (str): The column name where prediction scores will be stored. Defaults to “fineweb-edu-score”. int_column (str): The column name where integer-rounded prediction scores will be stored. Defaults to “fineweb-edu-score-int”. max_chars (int): The maximum number of characters in each document to consider for classification. If -1, the entire document is considered. Defaults to -1. device_type (str): The type of device to use for inference, either “cuda” or “cpu”. Defaults to “cuda”. autocast (bool): Whether to use mixed precision for faster inference. Defaults to True. max_mem_gb (int, optional): The maximum amount of memory in GB to allocate for the model. If None, it defaults to the available GPU memory minus 4 GB. - Initialization - Constructs a Module - Args: input_backend (Literal[“pandas”, “cudf”, “any”]): The backend the input dataframe must be on for the module to work name (str, Optional): The name of the module. If None, defaults to self.class.name 
- class classifiers.fineweb_edu.FineWebMixtralEduClassifier(
- batch_size: int = 1024,
- text_field: str = 'text',
- pred_column: str = 'fineweb-mixtral-edu-score',
- int_column: str = 'fineweb-mixtral-edu-score-int',
- quality_label_column: str = 'fineweb-mixtral-edu-score-label',
- max_chars: int = -1,
- device_type: str = 'cuda',
- autocast: bool = True,
- max_mem_gb: int | None = None,
- Bases: - classifiers.fineweb_edu._FineWebBaseClassifier- FineWebMixtralEduClassifier is a specialized classifier designed for educational content assessment, utilizing the NemoCurator FineWeb Mixtral Edu Classifier model (https://huggingface.co/nvidia/nemocurator-fineweb-mixtral-edu-classifier). It is similar to the FineWeb-Edu classifier and was trained on the same text samples, but using annotations from Mixtral 8x22B-Instruct. This classifier is optimized for running on multi-node, multi-GPU setups to enable fast and efficient inference on large text datasets. - Attributes: batch_size (int): The number of samples per batch for inference. Defaults to 256. text_field (str): The column name containing the text data to be classified. Defaults to “text”. pred_column (str): The column name where prediction scores will be stored. Defaults to “fineweb-mixtral-edu-score”. int_column (str): The column name where integer-rounded prediction scores will be stored. Defaults to “fineweb-mixtral-edu-score-int”. quality_label_column (str): The column name where a score of >= 2.5 is labeled “high_quality” and otherwise labeled “low_quality”. Defaults to “fineweb-mixtral-edu-score-label”. max_chars (int): The maximum number of characters in each document to consider for classification. If -1, the entire document is considered. Defaults to -1. device_type (str): The type of device to use for inference, either “cuda” or “cpu”. Defaults to “cuda”. autocast (bool): Whether to use mixed precision for faster inference. Defaults to True. max_mem_gb (int, optional): The maximum amount of memory in GB to allocate for the model. If None, it defaults to the available GPU memory minus 4 GB. - Initialization - Constructs a Module - Args: input_backend (Literal[“pandas”, “cudf”, “any”]): The backend the input dataframe must be on for the module to work name (str, Optional): The name of the module. If None, defaults to self.class.name 
- class classifiers.fineweb_edu.FineWebNemotronEduClassifier(
- batch_size: int = 1024,
- text_field: str = 'text',
- pred_column: str = 'fineweb-nemotron-edu-score',
- int_column: str = 'fineweb-nemotron-edu-score-int',
- quality_label_column: str = 'fineweb-nemotron-edu-score-label',
- max_chars: int = -1,
- device_type: str = 'cuda',
- autocast: bool = True,
- max_mem_gb: int | None = None,
- Bases: - classifiers.fineweb_edu._FineWebBaseClassifier- FineWebNemotronEduClassifier is a specialized classifier designed for educational content assessment, utilizing the NemoCurator FineWeb Nemotron-4 Edu Classifier model (https://huggingface.co/nvidia/nemocurator-fineweb-nemotron-4-edu-classifier). It is similar to the FineWeb-Edu classifier and was trained on the same text samples, but using annotations from Nemotron-4-340B-Instruct. This classifier is optimized for running on multi-node, multi-GPU setups to enable fast and efficient inference on large text datasets. - Attributes: batch_size (int): The number of samples per batch for inference. Defaults to 256. text_field (str): The column name containing the text data to be classified. Defaults to “text”. pred_column (str): The column name where prediction scores will be stored. Defaults to “fineweb-nemotron-edu-score”. int_column (str): The column name where integer-rounded prediction scores will be stored. Defaults to “fineweb-nemotron-edu-score-int”. quality_label_column (str): The column name where a score of >= 2.5 is labeled “high_quality” and otherwise labeled “low_quality”. Defaults to “fineweb-nemotron-edu-score-label”. max_chars (int): The maximum number of characters in each document to consider for classification. If -1, the entire document is considered. Defaults to -1. device_type (str): The type of device to use for inference, either “cuda” or “cpu”. Defaults to “cuda”. autocast (bool): Whether to use mixed precision for faster inference. Defaults to True. max_mem_gb (int, optional): The maximum amount of memory in GB to allocate for the model. If None, it defaults to the available GPU memory minus 4 GB. - Initialization - Constructs a Module - Args: input_backend (Literal[“pandas”, “cudf”, “any”]): The backend the input dataframe must be on for the module to work name (str, Optional): The name of the module. If None, defaults to self.class.name 
- class classifiers.fineweb_edu.FinewebEduModel(
- path_or_name: str,
- max_mem_gb: int | None = None,
- autocast: bool = False,
- Bases: - crossfit.backend.torch.hf.model.HFModel- Initialization - static configure_forward(
- model: torch.nn.Module,
- autocast: bool = True,
 - load_config() transformers.AutoConfig#
 - load_model(device: str = 'cuda') torch.nn.Module#
 - load_tokenizer() transformers.AutoTokenizer#