`stages.text.classifiers.fineweb_edu`#

Module Contents#

Classes#

`FineWebEduClassifier`	FineWebEduClassifier is a specialized classifier designed for educational content assessment, utilizing the Hugging Face FineWeb EDU Classifier model (https://huggingface.co/HuggingFaceFW/fineweb-edu-classifier). This classifier is optimized for running on multi-node, multi-GPU setups to enable fast and efficient inference on large text datasets.
`FineWebMixtralEduClassifier`	FineWebMixtralEduClassifier is a specialized classifier designed for educational content assessment, utilizing the NemoCurator FineWeb Mixtral Edu Classifier model (https://huggingface.co/nvidia/nemocurator-fineweb-mixtral-edu-classifier). It is similar to the FineWeb-Edu classifier and was trained on the same text samples, but using annotations from Mixtral 8x22B-Instruct. This classifier is optimized for running on multi-node, multi-GPU setups to enable fast and efficient inference on large text datasets.
`FineWebModelStage`	Stage for Hugging Face model inference.
`FineWebNemotronEduClassifier`	FineWebNemotronEduClassifier is a specialized classifier designed for educational content assessment, utilizing the NemoCurator FineWeb Nemotron-4 Edu Classifier model (https://huggingface.co/nvidia/nemocurator-fineweb-nemotron-4-edu-classifier). It is similar to the FineWeb-Edu classifier and was trained on the same text samples, but using annotations from Nemotron-4-340B-Instruct. This classifier is optimized for running on multi-node, multi-GPU setups to enable fast and efficient inference on large text datasets.

Data#

`FINEWEB_EDU_MODEL_IDENTIFIER`
`FINEWEB_MIXTRAL_EDU_MODEL_IDENTIFIER`
`FINEWEB_NEMOTRON_EDU_MODEL_IDENTIFIER`
`MAX_SEQ_LENGTH`

API#

stages.text.classifiers.fineweb_edu.FINEWEB_EDU_MODEL_IDENTIFIER#: ‘HuggingFaceFW/fineweb-edu-classifier’

stages.text.classifiers.fineweb_edu.FINEWEB_MIXTRAL_EDU_MODEL_IDENTIFIER#: ‘nvidia/nemocurator-fineweb-mixtral-edu-classifier’

stages.text.classifiers.fineweb_edu.FINEWEB_NEMOTRON_EDU_MODEL_IDENTIFIER#: ‘nvidia/nemocurator-fineweb-nemotron-4-edu-classifier’

class stages.text.classifiers.fineweb_edu.FineWebEduClassifier( cache_dir: str | None = None, pred_column: str = 'fineweb-edu-score-label', float_score_column: str = 'fineweb-edu-score-float', int_score_column: str = 'fineweb-edu-score-int', text_field: str = 'text', filter_by: list[str] | None = None, max_chars: int | None = None, sort_by_length: bool = True, model_inference_batch_size: int = 256, autocast: bool = True, )#

Bases: stages.text.classifiers.fineweb_edu._FineWebBaseClassifier

FineWebEduClassifier is a specialized classifier designed for educational content assessment, utilizing the Hugging Face FineWeb EDU Classifier model (https://huggingface.co/HuggingFaceFW/fineweb-edu-classifier). This classifier is optimized for running on multi-node, multi-GPU setups to enable fast and efficient inference on large text datasets.

Attributes: cache_dir: The Hugging Face cache directory. Defaults to None. pred_column: The name of the prediction column. Defaults to “fineweb-edu-score-label”. float_score_column: The name of the float score column. Defaults to “fineweb-edu-score-float”. int_score_column: The name of the integer score column. Defaults to “fineweb-edu-score-int”. text_field: The name of the text field in the input data. Defaults to “text”. filter_by: For categorical classifiers, the list of labels to filter the data by. Defaults to None. max_chars: Limits the total number of characters that can be fed to the tokenizer. If None, text will not be truncated. Defaults to None. sort_by_length: Whether to sort the input data by the length of the input tokens. Sorting is encouraged to improve the performance of the inference model. Defaults to True. model_inference_batch_size: The size of the batch for model inference. Defaults to 256. autocast: Whether to use autocast. When True, we trade off minor accuracy for faster inference. Defaults to True.

Initialization

class stages.text.classifiers.fineweb_edu.FineWebMixtralEduClassifier( cache_dir: str | None = None, pred_column: str = 'fineweb-mixtral-edu-score-label', float_score_column: str = 'fineweb-mixtral-edu-score-float', int_score_column: str = 'fineweb-mixtral-edu-score-int', text_field: str = 'text', filter_by: list[str] | None = None, max_chars: int | None = None, sort_by_length: bool = True, model_inference_batch_size: int = 1024, autocast: bool = True, )#

Bases: stages.text.classifiers.fineweb_edu._FineWebBaseClassifier

FineWebMixtralEduClassifier is a specialized classifier designed for educational content assessment, utilizing the NemoCurator FineWeb Mixtral Edu Classifier model (https://huggingface.co/nvidia/nemocurator-fineweb-mixtral-edu-classifier). It is similar to the FineWeb-Edu classifier and was trained on the same text samples, but using annotations from Mixtral 8x22B-Instruct. This classifier is optimized for running on multi-node, multi-GPU setups to enable fast and efficient inference on large text datasets.

Attributes: cache_dir: The Hugging Face cache directory. Defaults to None. pred_column: The name of the prediction column. Defaults to “fineweb-mixtral-edu-score-label”. float_score_column: The name of the float score column. Defaults to “fineweb-mixtral-edu-score-float”. int_score_column: The name of the integer score column. Defaults to “fineweb-mixtral-edu-score-int”. text_field: The name of the text field in the input data. Defaults to “text”. filter_by: For categorical classifiers, the list of labels to filter the data by. Defaults to None. max_chars: Limits the total number of characters that can be fed to the tokenizer. If None, text will not be truncated. Defaults to None. sort_by_length: Whether to sort the input data by the length of the input tokens. Sorting is encouraged to improve the performance of the inference model. Defaults to True. model_inference_batch_size: The size of the batch for model inference. Defaults to 1024. autocast: Whether to use autocast. When True, we trade off minor accuracy for faster inference. Defaults to True.

Initialization

class stages.text.classifiers.fineweb_edu.FineWebModelStage( model_identifier: str, cache_dir: str | None = None, pred_column: str = 'preds', float_score_column: str = 'float_score', int_score_column: str = 'int_score', model_inference_batch_size: int = 256, has_seq_order: bool = True, autocast: bool = True, )#

Bases: nemo_curator.stages.text.models.model.ModelStage

Stage for Hugging Face model inference.

Args: model_identifier: The identifier of the Hugging Face model. cache_dir: The Hugging Face cache directory. Defaults to None. pred_column: The name of the prediction column. float_score_column: The name of the float score column. int_score_column: The name of the integer score column. model_inference_batch_size: The size of the batch for model inference. Defaults to 256. has_seq_order: Whether to sort the input data by the length of the input tokens. Sorting is encouraged to improve the performance of the inference model. Defaults to True. autocast: Whether to use autocast. When True, we trade off minor accuracy for faster inference. Defaults to True.

Initialization

static configure_forward(model: torch.nn.Module) → torch.nn.Module#

create_output_dataframe( df_cpu: pandas.DataFrame, collected_output: dict[str, numpy.ndarray], ) → pandas.DataFrame#

outputs() → tuple[list[str], list[str]]#

Define stage output specification.

Returns (tuple[list[str], list[str]]): Tuple of (output_attributes, output_columns) where: - output_top_level_attributes: List of task attributes this stage adds/modifies - output_data_attributes: List of attributes within the data that this stage adds/modifies

process_model_output( outputs: torch.Tensor, _: dict[str, torch.Tensor] | None = None, ) → dict[str, numpy.ndarray]#

class stages.text.classifiers.fineweb_edu.FineWebNemotronEduClassifier( cache_dir: str | None = None, pred_column: str = 'fineweb-nemotron-edu-score-label', float_score_column: str = 'fineweb-nemotron-edu-score-float', int_score_column: str = 'fineweb-nemotron-edu-score-int', text_field: str = 'text', filter_by: list[str] | None = None, max_chars: int | None = None, sort_by_length: bool = True, model_inference_batch_size: int = 1024, autocast: bool = True, )#

Bases: stages.text.classifiers.fineweb_edu._FineWebBaseClassifier

FineWebNemotronEduClassifier is a specialized classifier designed for educational content assessment, utilizing the NemoCurator FineWeb Nemotron-4 Edu Classifier model (https://huggingface.co/nvidia/nemocurator-fineweb-nemotron-4-edu-classifier). It is similar to the FineWeb-Edu classifier and was trained on the same text samples, but using annotations from Nemotron-4-340B-Instruct. This classifier is optimized for running on multi-node, multi-GPU setups to enable fast and efficient inference on large text datasets.

Attributes: cache_dir: The Hugging Face cache directory. Defaults to None. pred_column: The name of the prediction column. Defaults to “fineweb-nemotron-edu-score-label”. float_score_column: The name of the float score column. Defaults to “fineweb-nemotron-edu-score-float”. int_score_column: The name of the integer score column. Defaults to “fineweb-nemotron-edu-score-int”. text_field: The name of the text field in the input data. Defaults to “text”. filter_by: For categorical classifiers, the list of labels to filter the data by. Defaults to None. max_chars: Limits the total number of characters that can be fed to the tokenizer. If None, text will not be truncated. Defaults to None. sort_by_length: Whether to sort the input data by the length of the input tokens. Sorting is encouraged to improve the performance of the inference model. Defaults to True. model_inference_batch_size: The size of the batch for model inference. Defaults to 1024. autocast: Whether to use autocast. When True, we trade off minor accuracy for faster inference. Defaults to True.

Initialization

stages.text.classifiers.fineweb_edu.MAX_SEQ_LENGTH#: 512

stages.text.classifiers.fineweb_edu#

Module Contents#

Classes#

Data#

API#

`stages.text.classifiers.fineweb_edu`#