nemo_curator.stages.text.classifiers.fineweb_edu

View as Markdown

Module Contents

Classes

NameDescription
FineWebEduClassifierFineWebEduClassifier is a specialized classifier designed for educational content assessment,
FineWebMixtralEduClassifierFineWebMixtralEduClassifier is a specialized classifier designed for educational content assessment,
FineWebModelStageStage for Hugging Face model inference.
FineWebNemotronEduClassifierFineWebNemotronEduClassifier is a specialized classifier designed for educational content assessment,
_FineWebBaseClassifierParent class for FineWebEduClassifier, FineWebMixtralEduClassifier, and FineWebNemotronEduClassifier,

Data

FINEWEB_EDU_MODEL_IDENTIFIER

FINEWEB_MIXTRAL_EDU_MODEL_IDENTIFIER

FINEWEB_NEMOTRON_EDU_MODEL_IDENTIFIER

MAX_SEQ_LENGTH

API

class nemo_curator.stages.text.classifiers.fineweb_edu.FineWebEduClassifier(
cache_dir: str | None = None,
label_field: str = 'fineweb-edu-score-label',
float_score_field: str = 'fineweb-edu-score-float',
int_score_field: str = 'fineweb-edu-score-int',
text_field: str = 'text',
filter_by: list[str] | None = None,
max_chars: int | None = None,
sort_by_length: bool = True,
model_inference_batch_size: int = 256,
autocast: bool = True,
keep_tokens: bool = False,
use_existing_tokens: bool = False
)

Bases: _FineWebBaseClassifier

FineWebEduClassifier is a specialized classifier designed for educational content assessment, utilizing the Hugging Face FineWeb EDU Classifier model (https://huggingface.co/HuggingFaceFW/fineweb-edu-classifier). This classifier is optimized for running on multi-node, multi-GPU setups to enable fast and efficient inference on large text datasets.

name
class nemo_curator.stages.text.classifiers.fineweb_edu.FineWebMixtralEduClassifier(
cache_dir: str | None = None,
label_field: str = 'fineweb-mixtral-edu-score-...,
float_score_field: str = 'fineweb-mixtral-edu-score-...,
int_score_field: str = 'fineweb-mixtral-edu-score-...,
text_field: str = 'text',
filter_by: list[str] | None = None,
max_chars: int | None = None,
sort_by_length: bool = True,
model_inference_batch_size: int = 1024,
autocast: bool = True,
keep_tokens: bool = False,
use_existing_tokens: bool = False
)

Bases: _FineWebBaseClassifier

FineWebMixtralEduClassifier is a specialized classifier designed for educational content assessment, utilizing the NemoCurator FineWeb Mixtral Edu Classifier model (https://huggingface.co/nvidia/nemocurator-fineweb-mixtral-edu-classifier). It is similar to the FineWeb-Edu classifier and was trained on the same text samples, but using annotations from Mixtral 8x22B-Instruct. This classifier is optimized for running on multi-node, multi-GPU setups to enable fast and efficient inference on large text datasets.

name
class nemo_curator.stages.text.classifiers.fineweb_edu.FineWebModelStage(
model_identifier: str,
cache_dir: str | None = None,
label_field: str = 'preds',
float_score_field: str = 'float_score',
int_score_field: str = 'int_score',
model_inference_batch_size: int = 256,
has_seq_order: bool = True,
autocast: bool = True,
keep_tokens: bool = False
)

Bases: ModelStage

Stage for Hugging Face model inference.

Parameters:

model_identifier
str

The identifier of the Hugging Face model.

cache_dir
str | NoneDefaults to None

The Hugging Face cache directory. Defaults to None.

label_field
strDefaults to 'preds'

The name of the prediction column.

float_score_field
strDefaults to 'float_score'

The name of the float score column.

int_score_field
strDefaults to 'int_score'

The name of the integer score column.

model_inference_batch_size
intDefaults to 256

The size of the batch for model inference. Defaults to 256.

has_seq_order
boolDefaults to True

Whether to sort the input data by the length of the input tokens. Sorting is encouraged to improve the performance of the inference model. Defaults to True.

autocast
boolDefaults to True

Whether to use autocast. When True, we trade off minor accuracy for faster inference. Defaults to True.

keep_tokens
boolDefaults to False

Whether to keep the input tokens in the output dataframe. Defaults to False.

nemo_curator.stages.text.classifiers.fineweb_edu.FineWebModelStage._setup(
local_files_only: bool = True
) -> None
nemo_curator.stages.text.classifiers.fineweb_edu.FineWebModelStage.configure_forward(
model: torch.nn.Module
) -> torch.nn.Module
staticmethod
nemo_curator.stages.text.classifiers.fineweb_edu.FineWebModelStage.create_output_dataframe(
df_cpu: pandas.DataFrame,
collected_output: dict[str, numpy.ndarray]
) -> pandas.DataFrame
nemo_curator.stages.text.classifiers.fineweb_edu.FineWebModelStage.outputs() -> tuple[list[str], list[str]]
nemo_curator.stages.text.classifiers.fineweb_edu.FineWebModelStage.process_model_output(
outputs: torch.Tensor,
_: dict[str, torch.Tensor] | None = None
) -> dict[str, numpy.ndarray]
class nemo_curator.stages.text.classifiers.fineweb_edu.FineWebNemotronEduClassifier(
cache_dir: str | None = None,
label_field: str = 'fineweb-nemotron-edu-score...,
float_score_field: str = 'fineweb-nemotron-edu-score...,
int_score_field: str = 'fineweb-nemotron-edu-score...,
text_field: str = 'text',
filter_by: list[str] | None = None,
max_chars: int | None = None,
sort_by_length: bool = True,
model_inference_batch_size: int = 1024,
autocast: bool = True,
keep_tokens: bool = False,
use_existing_tokens: bool = False
)

Bases: _FineWebBaseClassifier

FineWebNemotronEduClassifier is a specialized classifier designed for educational content assessment, utilizing the NemoCurator FineWeb Nemotron-4 Edu Classifier model (https://huggingface.co/nvidia/nemocurator-fineweb-nemotron-4-edu-classifier). It is similar to the FineWeb-Edu classifier and was trained on the same text samples, but using annotations from Nemotron-4-340B-Instruct. This classifier is optimized for running on multi-node, multi-GPU setups to enable fast and efficient inference on large text datasets.

name
class nemo_curator.stages.text.classifiers.fineweb_edu._FineWebBaseClassifier(
model_identifier: str,
cache_dir: str | None = None,
label_field: str = 'preds',
float_score_field: str = 'float_score',
int_score_field: str = 'int_score',
text_field: str = 'text',
filter_by: list[str] | None = None,
max_chars: int | None = None,
max_seq_length: int = MAX_SEQ_LENGTH,
sort_by_length: bool = True,
model_inference_batch_size: int = 256,
autocast: bool = True,
keep_tokens: bool = False,
use_existing_tokens: bool = False
)
Dataclass

Bases: CompositeStage[DocumentBatch, DocumentBatch]

Parent class for FineWebEduClassifier, FineWebMixtralEduClassifier, and FineWebNemotronEduClassifier, since their implementations are almost identical.

Parameters:

model_identifier
str

The identifier of the Hugging Face model.

cache_dir
str | NoneDefaults to None

The Hugging Face cache directory. Defaults to None.

label_field
strDefaults to 'preds'

The name of the prediction column.

float_score_field
strDefaults to 'float_score'

The name of the float score column.

int_score_field
strDefaults to 'int_score'

The name of the integer score column.

text_field
strDefaults to 'text'

The name of the text field in the input data. Defaults to “text”.

filter_by
list[str] | NoneDefaults to None

For categorical classifiers, the list of labels to filter the data by. Defaults to None.

max_chars
int | NoneDefaults to None

Limits the total number of characters that can be fed to the tokenizer. If None, text will not be truncated. Defaults to None.

max_seq_length
intDefaults to MAX_SEQ_LENGTH

Limits the total sequence returned by the tokenizer so that it has a maximum length. Defaults to 512.

sort_by_length
boolDefaults to True

Whether to sort the input data by the length of the input tokens. Sorting is encouraged to improve the performance of the inference model. Defaults to True.

model_inference_batch_size
intDefaults to 256

The size of the batch for model inference. Defaults to 256.

autocast
boolDefaults to True

Whether to use autocast. When True, we trade off minor accuracy for faster inference. Defaults to True.

keep_tokens
boolDefaults to False

Whether to keep the input tokens in the output dataframe. Defaults to False.

use_existing_tokens
boolDefaults to False

Whether to use the existing tokens from the input dataframe. If True, assume the relevant token fields are [“input_ids”, “attention_mask”] and skip tokenization. Defaults to False.

autocast
bool = True
cache_dir
str | None = None
filter_by
list[str] | None = None
float_score_field
str = 'float_score'
int_score_field
str = 'int_score'
keep_tokens
bool = False
label_field
str = 'preds'
max_chars
int | None = None
max_seq_length
int = MAX_SEQ_LENGTH
model_identifier
str
model_inference_batch_size
int = 256
sort_by_length
bool = True
text_field
str = 'text'
use_existing_tokens
bool = False
nemo_curator.stages.text.classifiers.fineweb_edu._FineWebBaseClassifier.__post_init__() -> None
nemo_curator.stages.text.classifiers.fineweb_edu._FineWebBaseClassifier.decompose() -> list[nemo_curator.stages.base.ProcessingStage]
nemo_curator.stages.text.classifiers.fineweb_edu._FineWebBaseClassifier.filter_by_category(
value: str
) -> bool
nemo_curator.stages.text.classifiers.fineweb_edu._FineWebBaseClassifier.inputs() -> tuple[list[str], list[str]]
nemo_curator.stages.text.classifiers.fineweb_edu._FineWebBaseClassifier.outputs() -> tuple[list[str], list[str]]
nemo_curator.stages.text.classifiers.fineweb_edu.FINEWEB_EDU_MODEL_IDENTIFIER = 'HuggingFaceFW/fineweb-edu-classifier'
nemo_curator.stages.text.classifiers.fineweb_edu.FINEWEB_MIXTRAL_EDU_MODEL_IDENTIFIER = 'nvidia/nemocurator-fineweb-mixtral-edu-classifier'
nemo_curator.stages.text.classifiers.fineweb_edu.FINEWEB_NEMOTRON_EDU_MODEL_IDENTIFIER = 'nvidia/nemocurator-fineweb-nemotron-4-edu-classifier'
nemo_curator.stages.text.classifiers.fineweb_edu.MAX_SEQ_LENGTH = 512