classifiers.fineweb_edu#

Module Contents#

Classes#

FineWebEduClassifier

FineWebEduClassifier is a specialized classifier designed for educational content assessment, utilizing the Hugging Face FineWeb EDU Classifier model (https://huggingface.co/HuggingFaceFW/fineweb-edu-classifier). This classifier is optimized for running on multi-node, multi-GPU setups to enable fast and efficient inference on large text datasets.

FineWebMixtralEduClassifier

FineWebMixtralEduClassifier is a specialized classifier designed for educational content assessment, utilizing the NemoCurator FineWeb Mixtral Edu Classifier model (https://huggingface.co/nvidia/nemocurator-fineweb-mixtral-edu-classifier). It is similar to the FineWeb-Edu classifier and was trained on the same text samples, but using annotations from Mixtral 8x22B-Instruct. This classifier is optimized for running on multi-node, multi-GPU setups to enable fast and efficient inference on large text datasets.

FineWebNemotronEduClassifier

FineWebNemotronEduClassifier is a specialized classifier designed for educational content assessment, utilizing the NemoCurator FineWeb Nemotron-4 Edu Classifier model (https://huggingface.co/nvidia/nemocurator-fineweb-nemotron-4-edu-classifier). It is similar to the FineWeb-Edu classifier and was trained on the same text samples, but using annotations from Nemotron-4-340B-Instruct. This classifier is optimized for running on multi-node, multi-GPU setups to enable fast and efficient inference on large text datasets.

FinewebEduModel

Data#

API#

classifiers.fineweb_edu.FINEWEB_EDU_IDENTIFIER#

‘HuggingFaceFW/fineweb-edu-classifier’

classifiers.fineweb_edu.FINEWEB_MIXTRAL_IDENTIFIER#

‘nvidia/nemocurator-fineweb-mixtral-edu-classifier’

classifiers.fineweb_edu.FINEWEB_NEMOTRON_IDENTIFIER#

‘nvidia/nemocurator-fineweb-nemotron-4-edu-classifier’

class classifiers.fineweb_edu.FineWebEduClassifier(
batch_size: int = 256,
text_field: str = 'text',
pred_column: str = 'fineweb-edu-score',
int_column: str = 'fineweb-edu-score-int',
max_chars: int = -1,
device_type: str = 'cuda',
autocast: bool = True,
max_mem_gb: int | None = None,
)#

Bases: classifiers.fineweb_edu._FineWebBaseClassifier

FineWebEduClassifier is a specialized classifier designed for educational content assessment, utilizing the Hugging Face FineWeb EDU Classifier model (https://huggingface.co/HuggingFaceFW/fineweb-edu-classifier). This classifier is optimized for running on multi-node, multi-GPU setups to enable fast and efficient inference on large text datasets.

Attributes: batch_size (int): The number of samples per batch for inference. Defaults to 256. text_field (str): The column name containing the text data to be classified. Defaults to “text”. pred_column (str): The column name where prediction scores will be stored. Defaults to “fineweb-edu-score”. int_column (str): The column name where integer-rounded prediction scores will be stored. Defaults to “fineweb-edu-score-int”. max_chars (int): The maximum number of characters in each document to consider for classification. If -1, the entire document is considered. Defaults to -1. device_type (str): The type of device to use for inference, either “cuda” or “cpu”. Defaults to “cuda”. autocast (bool): Whether to use mixed precision for faster inference. Defaults to True. max_mem_gb (int, optional): The maximum amount of memory in GB to allocate for the model. If None, it defaults to the available GPU memory minus 4 GB.

Initialization

Constructs a Module

Args: input_backend (Literal[“pandas”, “cudf”, “any”]): The backend the input dataframe must be on for the module to work name (str, Optional): The name of the module. If None, defaults to self.class.name

class classifiers.fineweb_edu.FineWebMixtralEduClassifier(
batch_size: int = 1024,
text_field: str = 'text',
pred_column: str = 'fineweb-mixtral-edu-score',
int_column: str = 'fineweb-mixtral-edu-score-int',
quality_label_column: str = 'fineweb-mixtral-edu-score-label',
max_chars: int = -1,
device_type: str = 'cuda',
autocast: bool = True,
max_mem_gb: int | None = None,
)#

Bases: classifiers.fineweb_edu._FineWebBaseClassifier

FineWebMixtralEduClassifier is a specialized classifier designed for educational content assessment, utilizing the NemoCurator FineWeb Mixtral Edu Classifier model (https://huggingface.co/nvidia/nemocurator-fineweb-mixtral-edu-classifier). It is similar to the FineWeb-Edu classifier and was trained on the same text samples, but using annotations from Mixtral 8x22B-Instruct. This classifier is optimized for running on multi-node, multi-GPU setups to enable fast and efficient inference on large text datasets.

Attributes: batch_size (int): The number of samples per batch for inference. Defaults to 256. text_field (str): The column name containing the text data to be classified. Defaults to “text”. pred_column (str): The column name where prediction scores will be stored. Defaults to “fineweb-mixtral-edu-score”. int_column (str): The column name where integer-rounded prediction scores will be stored. Defaults to “fineweb-mixtral-edu-score-int”. quality_label_column (str): The column name where a score of >= 2.5 is labeled “high_quality” and otherwise labeled “low_quality”. Defaults to “fineweb-mixtral-edu-score-label”. max_chars (int): The maximum number of characters in each document to consider for classification. If -1, the entire document is considered. Defaults to -1. device_type (str): The type of device to use for inference, either “cuda” or “cpu”. Defaults to “cuda”. autocast (bool): Whether to use mixed precision for faster inference. Defaults to True. max_mem_gb (int, optional): The maximum amount of memory in GB to allocate for the model. If None, it defaults to the available GPU memory minus 4 GB.

Initialization

Constructs a Module

Args: input_backend (Literal[“pandas”, “cudf”, “any”]): The backend the input dataframe must be on for the module to work name (str, Optional): The name of the module. If None, defaults to self.class.name

class classifiers.fineweb_edu.FineWebNemotronEduClassifier(
batch_size: int = 1024,
text_field: str = 'text',
pred_column: str = 'fineweb-nemotron-edu-score',
int_column: str = 'fineweb-nemotron-edu-score-int',
quality_label_column: str = 'fineweb-nemotron-edu-score-label',
max_chars: int = -1,
device_type: str = 'cuda',
autocast: bool = True,
max_mem_gb: int | None = None,
)#

Bases: classifiers.fineweb_edu._FineWebBaseClassifier

FineWebNemotronEduClassifier is a specialized classifier designed for educational content assessment, utilizing the NemoCurator FineWeb Nemotron-4 Edu Classifier model (https://huggingface.co/nvidia/nemocurator-fineweb-nemotron-4-edu-classifier). It is similar to the FineWeb-Edu classifier and was trained on the same text samples, but using annotations from Nemotron-4-340B-Instruct. This classifier is optimized for running on multi-node, multi-GPU setups to enable fast and efficient inference on large text datasets.

Attributes: batch_size (int): The number of samples per batch for inference. Defaults to 256. text_field (str): The column name containing the text data to be classified. Defaults to “text”. pred_column (str): The column name where prediction scores will be stored. Defaults to “fineweb-nemotron-edu-score”. int_column (str): The column name where integer-rounded prediction scores will be stored. Defaults to “fineweb-nemotron-edu-score-int”. quality_label_column (str): The column name where a score of >= 2.5 is labeled “high_quality” and otherwise labeled “low_quality”. Defaults to “fineweb-nemotron-edu-score-label”. max_chars (int): The maximum number of characters in each document to consider for classification. If -1, the entire document is considered. Defaults to -1. device_type (str): The type of device to use for inference, either “cuda” or “cpu”. Defaults to “cuda”. autocast (bool): Whether to use mixed precision for faster inference. Defaults to True. max_mem_gb (int, optional): The maximum amount of memory in GB to allocate for the model. If None, it defaults to the available GPU memory minus 4 GB.

Initialization

Constructs a Module

Args: input_backend (Literal[“pandas”, “cudf”, “any”]): The backend the input dataframe must be on for the module to work name (str, Optional): The name of the module. If None, defaults to self.class.name

class classifiers.fineweb_edu.FinewebEduModel(
path_or_name: str,
max_mem_gb: int | None = None,
autocast: bool = False,
)#

Bases: crossfit.backend.torch.hf.model.HFModel

Initialization

static configure_forward(
model: torch.nn.Module,
autocast: bool = True,
) torch.nn.Module#
load_config() transformers.AutoConfig#
load_model(device: str = 'cuda') torch.nn.Module#
load_tokenizer() transformers.AutoTokenizer#