`classifiers.fineweb_edu`#

Module Contents#

Classes#

`FineWebEduClassifier`	FineWebEduClassifier is a specialized classifier designed for educational content assessment, utilizing the Hugging Face FineWeb EDU Classifier model (https://huggingface.co/HuggingFaceFW/fineweb-edu-classifier). This classifier is optimized for running on multi-node, multi-GPU setups to enable fast and efficient inference on large text datasets.
`FineWebMixtralEduClassifier`	FineWebMixtralEduClassifier is a specialized classifier designed for educational content assessment, utilizing the NemoCurator FineWeb Mixtral Edu Classifier model (https://huggingface.co/nvidia/nemocurator-fineweb-mixtral-edu-classifier). It is similar to the FineWeb-Edu classifier and was trained on the same text samples, but using annotations from Mixtral 8x22B-Instruct. This classifier is optimized for running on multi-node, multi-GPU setups to enable fast and efficient inference on large text datasets.
`FineWebNemotronEduClassifier`	FineWebNemotronEduClassifier is a specialized classifier designed for educational content assessment, utilizing the NemoCurator FineWeb Nemotron-4 Edu Classifier model (https://huggingface.co/nvidia/nemocurator-fineweb-nemotron-4-edu-classifier). It is similar to the FineWeb-Edu classifier and was trained on the same text samples, but using annotations from Nemotron-4-340B-Instruct. This classifier is optimized for running on multi-node, multi-GPU setups to enable fast and efficient inference on large text datasets.
`FinewebEduModel`

Data#

`FINEWEB_EDU_IDENTIFIER`
`FINEWEB_MIXTRAL_IDENTIFIER`
`FINEWEB_NEMOTRON_IDENTIFIER`

API#

classifiers.fineweb_edu.FINEWEB_EDU_IDENTIFIER#: ‘HuggingFaceFW/fineweb-edu-classifier’

classifiers.fineweb_edu.FINEWEB_MIXTRAL_IDENTIFIER#: ‘nvidia/nemocurator-fineweb-mixtral-edu-classifier’

classifiers.fineweb_edu.FINEWEB_NEMOTRON_IDENTIFIER#: ‘nvidia/nemocurator-fineweb-nemotron-4-edu-classifier’

class classifiers.fineweb_edu.FineWebEduClassifier( batch_size: int = 256, text_field: str = 'text', pred_column: str = 'fineweb-edu-score', int_column: str = 'fineweb-edu-score-int', max_chars: int = -1, device_type: str = 'cuda', autocast: bool = True, max_mem_gb: int | None = None, )#

Bases: classifiers.fineweb_edu._FineWebBaseClassifier

FineWebEduClassifier is a specialized classifier designed for educational content assessment, utilizing the Hugging Face FineWeb EDU Classifier model (https://huggingface.co/HuggingFaceFW/fineweb-edu-classifier). This classifier is optimized for running on multi-node, multi-GPU setups to enable fast and efficient inference on large text datasets.

Attributes: batch_size (int): The number of samples per batch for inference. Defaults to 256. text_field (str): The column name containing the text data to be classified. Defaults to “text”. pred_column (str): The column name where prediction scores will be stored. Defaults to “fineweb-edu-score”. int_column (str): The column name where integer-rounded prediction scores will be stored. Defaults to “fineweb-edu-score-int”. max_chars (int): The maximum number of characters in each document to consider for classification. If -1, the entire document is considered. Defaults to -1. device_type (str): The type of device to use for inference, either “cuda” or “cpu”. Defaults to “cuda”. autocast (bool): Whether to use mixed precision for faster inference. Defaults to True. max_mem_gb (int, optional): The maximum amount of memory in GB to allocate for the model. If None, it defaults to the available GPU memory minus 4 GB.

Initialization

Constructs a Module

Args: input_backend (Literal[“pandas”, “cudf”, “any”]): The backend the input dataframe must be on for the module to work name (str, Optional): The name of the module. If None, defaults to self.class.name

class classifiers.fineweb_edu.FineWebMixtralEduClassifier( batch_size: int = 1024, text_field: str = 'text', pred_column: str = 'fineweb-mixtral-edu-score', int_column: str = 'fineweb-mixtral-edu-score-int', quality_label_column: str = 'fineweb-mixtral-edu-score-label', max_chars: int = -1, device_type: str = 'cuda', autocast: bool = True, max_mem_gb: int | None = None, )#

Bases: classifiers.fineweb_edu._FineWebBaseClassifier

FineWebMixtralEduClassifier is a specialized classifier designed for educational content assessment, utilizing the NemoCurator FineWeb Mixtral Edu Classifier model (https://huggingface.co/nvidia/nemocurator-fineweb-mixtral-edu-classifier). It is similar to the FineWeb-Edu classifier and was trained on the same text samples, but using annotations from Mixtral 8x22B-Instruct. This classifier is optimized for running on multi-node, multi-GPU setups to enable fast and efficient inference on large text datasets.

Attributes: batch_size (int): The number of samples per batch for inference. Defaults to 256. text_field (str): The column name containing the text data to be classified. Defaults to “text”. pred_column (str): The column name where prediction scores will be stored. Defaults to “fineweb-mixtral-edu-score”. int_column (str): The column name where integer-rounded prediction scores will be stored. Defaults to “fineweb-mixtral-edu-score-int”. quality_label_column (str): The column name where a score of >= 2.5 is labeled “high_quality” and otherwise labeled “low_quality”. Defaults to “fineweb-mixtral-edu-score-label”. max_chars (int): The maximum number of characters in each document to consider for classification. If -1, the entire document is considered. Defaults to -1. device_type (str): The type of device to use for inference, either “cuda” or “cpu”. Defaults to “cuda”. autocast (bool): Whether to use mixed precision for faster inference. Defaults to True. max_mem_gb (int, optional): The maximum amount of memory in GB to allocate for the model. If None, it defaults to the available GPU memory minus 4 GB.

Initialization

Constructs a Module

Args: input_backend (Literal[“pandas”, “cudf”, “any”]): The backend the input dataframe must be on for the module to work name (str, Optional): The name of the module. If None, defaults to self.class.name

class classifiers.fineweb_edu.FineWebNemotronEduClassifier( batch_size: int = 1024, text_field: str = 'text', pred_column: str = 'fineweb-nemotron-edu-score', int_column: str = 'fineweb-nemotron-edu-score-int', quality_label_column: str = 'fineweb-nemotron-edu-score-label', max_chars: int = -1, device_type: str = 'cuda', autocast: bool = True, max_mem_gb: int | None = None, )#

Bases: classifiers.fineweb_edu._FineWebBaseClassifier

FineWebNemotronEduClassifier is a specialized classifier designed for educational content assessment, utilizing the NemoCurator FineWeb Nemotron-4 Edu Classifier model (https://huggingface.co/nvidia/nemocurator-fineweb-nemotron-4-edu-classifier). It is similar to the FineWeb-Edu classifier and was trained on the same text samples, but using annotations from Nemotron-4-340B-Instruct. This classifier is optimized for running on multi-node, multi-GPU setups to enable fast and efficient inference on large text datasets.

Attributes: batch_size (int): The number of samples per batch for inference. Defaults to 256. text_field (str): The column name containing the text data to be classified. Defaults to “text”. pred_column (str): The column name where prediction scores will be stored. Defaults to “fineweb-nemotron-edu-score”. int_column (str): The column name where integer-rounded prediction scores will be stored. Defaults to “fineweb-nemotron-edu-score-int”. quality_label_column (str): The column name where a score of >= 2.5 is labeled “high_quality” and otherwise labeled “low_quality”. Defaults to “fineweb-nemotron-edu-score-label”. max_chars (int): The maximum number of characters in each document to consider for classification. If -1, the entire document is considered. Defaults to -1. device_type (str): The type of device to use for inference, either “cuda” or “cpu”. Defaults to “cuda”. autocast (bool): Whether to use mixed precision for faster inference. Defaults to True. max_mem_gb (int, optional): The maximum amount of memory in GB to allocate for the model. If None, it defaults to the available GPU memory minus 4 GB.

Initialization

Constructs a Module

Args: input_backend (Literal[“pandas”, “cudf”, “any”]): The backend the input dataframe must be on for the module to work name (str, Optional): The name of the module. If None, defaults to self.class.name

class classifiers.fineweb_edu.FinewebEduModel( path_or_name: str, max_mem_gb: int | None = None, autocast: bool = False, )#

Bases: crossfit.backend.torch.hf.model.HFModel

Initialization

static configure_forward( model: torch.nn.Module, autocast: bool = True, ) → torch.nn.Module#

load_config() → transformers.AutoConfig#

load_model(device: str = 'cuda') → torch.nn.Module#

load_tokenizer() → transformers.AutoTokenizer#

classifiers.fineweb_edu#

Module Contents#

Classes#

Data#

API#

`classifiers.fineweb_edu`#