nemo_curator.stages.text.classifiers.fineweb_edu
nemo_curator.stages.text.classifiers.fineweb_edu
Module Contents
Classes
Data
FINEWEB_MIXTRAL_EDU_MODEL_IDENTIFIER
FINEWEB_NEMOTRON_EDU_MODEL_IDENTIFIER
API
Bases: _FineWebBaseClassifier
FineWebEduClassifier is a specialized classifier designed for educational content assessment, utilizing the Hugging Face FineWeb EDU Classifier model (https://huggingface.co/HuggingFaceFW/fineweb-edu-classifier). This classifier is optimized for running on multi-node, multi-GPU setups to enable fast and efficient inference on large text datasets.
Bases: _FineWebBaseClassifier
FineWebMixtralEduClassifier is a specialized classifier designed for educational content assessment, utilizing the NemoCurator FineWeb Mixtral Edu Classifier model (https://huggingface.co/nvidia/nemocurator-fineweb-mixtral-edu-classifier). It is similar to the FineWeb-Edu classifier and was trained on the same text samples, but using annotations from Mixtral 8x22B-Instruct. This classifier is optimized for running on multi-node, multi-GPU setups to enable fast and efficient inference on large text datasets.
Bases: ModelStage
Stage for Hugging Face model inference.
Parameters:
The identifier of the Hugging Face model.
The Hugging Face cache directory. Defaults to None.
The name of the prediction column.
The name of the float score column.
The name of the integer score column.
The size of the batch for model inference. Defaults to 256.
Whether to sort the input data by the length of the input tokens. Sorting is encouraged to improve the performance of the inference model. Defaults to True.
Whether to use autocast. When True, we trade off minor accuracy for faster inference. Defaults to True.
Whether to keep the input tokens in the output dataframe. Defaults to False.
Bases: _FineWebBaseClassifier
FineWebNemotronEduClassifier is a specialized classifier designed for educational content assessment, utilizing the NemoCurator FineWeb Nemotron-4 Edu Classifier model (https://huggingface.co/nvidia/nemocurator-fineweb-nemotron-4-edu-classifier). It is similar to the FineWeb-Edu classifier and was trained on the same text samples, but using annotations from Nemotron-4-340B-Instruct. This classifier is optimized for running on multi-node, multi-GPU setups to enable fast and efficient inference on large text datasets.
Bases: CompositeStage[DocumentBatch, DocumentBatch]
Parent class for FineWebEduClassifier, FineWebMixtralEduClassifier, and FineWebNemotronEduClassifier, since their implementations are almost identical.
Parameters:
The identifier of the Hugging Face model.
The Hugging Face cache directory. Defaults to None.
The name of the prediction column.
The name of the float score column.
The name of the integer score column.
The name of the text field in the input data. Defaults to “text”.
For categorical classifiers, the list of labels to filter the data by. Defaults to None.
Limits the total number of characters that can be fed to the tokenizer. If None, text will not be truncated. Defaults to None.
Limits the total sequence returned by the tokenizer so that it has a maximum length. Defaults to 512.
Whether to sort the input data by the length of the input tokens. Sorting is encouraged to improve the performance of the inference model. Defaults to True.
The size of the batch for model inference. Defaults to 256.
Whether to use autocast. When True, we trade off minor accuracy for faster inference. Defaults to True.
Whether to keep the input tokens in the output dataframe. Defaults to False.
Whether to use the existing tokens from the input dataframe. If True, assume the relevant token fields are [“input_ids”, “attention_mask”] and skip tokenization. Defaults to False.