Curate TextProcess DataQuality Assessment

Distributed Data Classification

View as Markdown

NVIDIA NeMo Curator provides a module for performing distributed classification on large text datasets using GPU acceleration. This enables the categorization and filtering of text documents based on multiple dimensions such as domain, quality, safety, educational value, content type, and more. These classifications can enhance the quality of training data for large language models by identifying high-value content and removing problematic material.

How It Works

The distributed data classification in NeMo Curator works by:

  1. Parallel Processing: Chunking datasets across multiple computing nodes and GPUs to accelerate classification
  2. Pre-trained Models: Using specialized models for different classification tasks
  3. Batched Inference: Optimizing throughput with intelligent batching
  4. Consistent API: Providing a unified interface through the DistributedDataClassifier base class

The DistributedDataClassifier is designed to run on GPU clusters with minimal code changes regardless of which specific classifier you’re using. All classifiers support filtering based on classification results and storing prediction scores as metadata.

Distributed classification requires GPU acceleration and is not supported for CPU-only processing. As long as GPU resources are available and NeMo Curator is correctly installed, GPU acceleration is handled automatically.

Running the tutorial notebooks: The classification tutorial notebooks require the text_cuda12 or all installation extra to include all relevant dependencies. If you encounter ModuleNotFoundError, reinstall with the appropriate extra:

uv pip install “nemo-curator[text_cuda12]”

When using classifiers that download from Hugging Face (such as Aegis and InstructionDataGuard), set your HF_TOKEN environment variable to avoid rate limiting:

export HF_TOKEN=“your_token_here”


Usage

NVIDIA NeMo Curator provides a base class DistributedDataClassifier that can be extended to fit your specific model. The only requirement is that the model can fit on a single GPU. This module operates on the GPU and works within the pipeline framework using DocumentBatch processing.

Classifier Comparison

ClassifierPurposeModel LocationKey ParametersRequirements
DomainClassifierAssigns one of 26 domain labels (such as “Sports,” “Science,” “News”) to English textnvidia/domain-classifierfilter_by, text_fieldNone
MultilingualDomainClassifierAssigns domain labels to text in 52 languages; same labels as DomainClassifiernvidia/multilingual-domain-classifierfilter_by, text_fieldNone
QualityClassifierRates document quality as “Low,” “Medium,” or “High” using a DeBERTa modelnvidia/quality-classifier-debertafilter_by, text_fieldNone
AegisClassifierDetects unsafe content across 13 risk categories (violence, hate speech, and others) using LlamaGuardnvidia/Aegis-AI-Content-Safety-LlamaGuard-Defensive-1.0aegis_variant, filter_byHuggingFace token
InstructionDataGuardClassifierIdentifies LLM poisoning attacks in instruction-response pairsnvidia/instruction-data-guardtext_field, label_fieldHuggingFace token
FineWebEduClassifierScores educational value from 0 to 5 (0=spam, 5=scholarly) for training data selectionHuggingFaceFW/fineweb-edu-classifierlabel_field, int_fieldNone
FineWebMixtralEduClassifierScores educational value from 0 to 5 using Mixtral 8x22B annotation datanvidia/nemocurator-fineweb-mixtral-edu-classifierlabel_field, int_field, model_inference_batch_size=1024None
FineWebNemotronEduClassifierScores educational value from 0 to 5 using Nemotron-4-340B annotation datanvidia/nemocurator-fineweb-nemotron-4-edu-classifierlabel_field, int_field, model_inference_batch_size=1024None
ContentTypeClassifierCategorizes text into 11 speech types (such as “Blogs,” “News,” “Academic”)nvidia/content-type-classifier-debertafilter_by, text_fieldNone
PromptTaskComplexityClassifierLabels prompts by task type (such as QA and summarization) and complexity dimensionsnvidia/prompt-task-and-complexity-classifiertext_fieldNone

Domain Classifier

The Domain Classifier categorizes English text documents into specific domains or subject areas.

1from nemo_curator.pipeline import Pipeline
2from nemo_curator.stages.text.io.reader import JsonlReader
3from nemo_curator.stages.text.io.writer import JsonlWriter
4from nemo_curator.stages.text.classifiers import DomainClassifier
5
6# Create pipeline
7pipeline = Pipeline(name="domain_classification")
8
9# Load dataset
10reader = JsonlReader(
11 file_paths="books_dataset/",
12 fields=["text", "id"]
13)
14pipeline.add_stage(reader)
15
16# Apply the classifier, filtering for specific domains
17domain_classifier = DomainClassifier(filter_by=["Games", "Sports"])
18pipeline.add_stage(domain_classifier)
19
20# Save the results
21writer = JsonlWriter(path="games_and_sports/")
22pipeline.add_stage(writer)
23
24# Execute pipeline
25results = pipeline.run() # Uses XennaExecutor by default

Multilingual Domain Classifier

Functionally similar to the Domain Classifier, but supports 52 languages.

1from nemo_curator.pipeline import Pipeline
2from nemo_curator.stages.text.io.reader import JsonlReader
3from nemo_curator.stages.text.io.writer import JsonlWriter
4from nemo_curator.stages.text.classifiers import MultilingualDomainClassifier
5
6pipeline = Pipeline(name="multilingual_domain_classification")
7pipeline.add_stage(JsonlReader(file_paths="multilingual_dataset/", fields=["text", "id"]))
8pipeline.add_stage(MultilingualDomainClassifier(filter_by=["Games", "Sports"]))
9pipeline.add_stage(JsonlWriter(path="classified_output/"))
10
11results = pipeline.run() # Uses XennaExecutor by default

Quality Classifier

The Quality Classifier assesses document quality using the NVIDIA Quality Classifier DeBERTa model.

1from nemo_curator.pipeline import Pipeline
2from nemo_curator.stages.text.io.reader import JsonlReader
3from nemo_curator.stages.text.io.writer import JsonlWriter
4from nemo_curator.stages.text.classifiers import QualityClassifier
5
6pipeline = Pipeline(name="quality_classification")
7pipeline.add_stage(JsonlReader(file_paths="web_documents/", fields=["text", "id"]))
8pipeline.add_stage(QualityClassifier())
9pipeline.add_stage(JsonlWriter(path="quality_classified/"))
10
11results = pipeline.run() # Uses XennaExecutor by default

The exact label categories returned by the Quality Classifier depend on the model configuration. Check the prediction column in your results to see the available labels for filtering with the filter_by parameter.

AEGIS Safety Classifier

The AEGIS classifier detects unsafe content across 13 critical risk categories. It requires a HuggingFace token for access to Llama Guard.

1from nemo_curator.pipeline import Pipeline
2from nemo_curator.stages.text.io.reader import JsonlReader
3from nemo_curator.stages.text.io.writer import JsonlWriter
4from nemo_curator.stages.text.classifiers import AegisClassifier
5
6# Create pipeline
7pipeline = Pipeline(name="aegis_classification")
8
9# Load dataset
10reader = JsonlReader(
11 file_paths="content/",
12 fields=["text", "id"]
13)
14pipeline.add_stage(reader)
15
16# Apply the AEGIS classifier
17token = "hf_1234" # Your HuggingFace user access token
18safety_classifier = AegisClassifier(
19 aegis_variant="nvidia/Aegis-AI-Content-Safety-LlamaGuard-Defensive-1.0",
20 hf_token=token,
21 filter_by=["safe", "O13"] # Keep only safe content and "needs caution" category
22)
23pipeline.add_stage(safety_classifier)
24
25# Save the results
26writer = JsonlWriter(path="safe_content/")
27pipeline.add_stage(writer)
28
29# Execute pipeline
30results = pipeline.run() # Uses XennaExecutor by default

The classifier adds a column with labels: “safe,” “O1” through “O13” (each representing specific safety risks), or “unknown.” For raw LLM output, use:

1safety_classifier = AegisClassifier(
2 aegis_variant="nvidia/Aegis-AI-Content-Safety-LlamaGuard-Defensive-1.0",
3 hf_token=token,
4 keep_raw_output=True,
5 raw_output_field="raw_predictions"
6)

Instruction Data Guard

Detects LLM poisoning attacks in instruction-response datasets. Requires HuggingFace token access.

1from nemo_curator.pipeline import Pipeline
2from nemo_curator.stages.text.io.reader import JsonlReader
3from nemo_curator.stages.text.io.writer import JsonlWriter
4from nemo_curator.stages.text.classifiers import InstructionDataGuardClassifier
5
6# Create pipeline
7pipeline = Pipeline(name="instruction_data_guard")
8
9# Load dataset
10# For instruction-response data: "Instruction: {instruction}. Input: {input_}. Response: {response}."
11reader = JsonlReader(
12 file_paths="instruction_data/",
13 fields=["text", "id"]
14)
15pipeline.add_stage(reader)
16
17# Apply the classifier
18token = "hf_1234" # Your HuggingFace user access token
19classifier = InstructionDataGuardClassifier(hf_token=token)
20pipeline.add_stage(classifier)
21
22# Save the results
23writer = JsonlWriter(path="guard_classified/")
24pipeline.add_stage(writer)
25
26# Execute pipeline
27results = pipeline.run() # Uses XennaExecutor by default

The output includes two columns: a float score instruction_data_guard_poisoning_score and a Boolean is_poisoned.

FineWeb Educational Content Classifier

Scores documents on educational value from 0–5. This helps prioritize content for knowledge-intensive tasks.

Score Ranges and Meanings

ScoreLabelDescriptionExample Content
0-1Very LowNo educational valueSpam, advertisements, broken content
2LowMinimal educational contentSimple lists, basic product descriptions
3ModerateSome educational valueNews articles, basic how-to guides
4HighGood educational contentDetailed tutorials, academic discussions
5Very HighExcellent educational materialComprehensive guides, scholarly articles
1from nemo_curator.pipeline import Pipeline
2from nemo_curator.stages.text.io.reader import JsonlReader
3from nemo_curator.stages.text.io.writer import JsonlWriter
4from nemo_curator.stages.text.classifiers import FineWebEduClassifier
5
6# Create pipeline
7pipeline = Pipeline(name="fineweb_edu_classification")
8
9# Load dataset
10reader = JsonlReader(
11 file_paths="web_documents/*.jsonl",
12 fields=["text", "id"]
13)
14pipeline.add_stage(reader)
15
16# Apply the FineWeb Edu classifier
17edu_classifier = FineWebEduClassifier(
18 model_inference_batch_size=256,
19 float_score_field="fineweb-edu-score-float", # Raw float scores
20 int_score_field="fineweb-edu-score-int", # Rounded integer scores
21 label_field="fineweb-edu-score-label" # Quality labels
22)
23pipeline.add_stage(edu_classifier)
24
25# Save the results
26writer = JsonlWriter(path="edu_classified/")
27pipeline.add_stage(writer)
28
29# Execute pipeline
30results = pipeline.run() # Uses XennaExecutor by default

FineWeb Mixtral and Nemotron Edu Classifiers

Similar to the FineWeb Edu Classifier but trained with different annotation sources:

  • FineWebMixtralEduClassifier: Uses annotations from Mixtral 8x22B-Instruct
  • FineWebNemotronEduClassifier: Uses annotations from Nemotron-4-340B-Instruct

Both provide a quality label column marking scores above 2.5 as “high_quality”:

Quality Label Mapping

Score RangeQuality LabelDescription
0.0 - 2.5low_qualityBelow average educational value
2.5 - 5.0high_qualityAbove average educational value
1from nemo_curator.pipeline import Pipeline
2from nemo_curator.stages.text.io.reader import JsonlReader
3from nemo_curator.stages.text.io.writer import JsonlWriter
4from nemo_curator.stages.text.classifiers import FineWebMixtralEduClassifier # or FineWebNemotronEduClassifier
5
6# Create pipeline
7pipeline = Pipeline(name="fineweb_mixtral_edu_classification")
8
9# Load dataset
10reader = JsonlReader(
11 file_paths="web_documents/*.jsonl",
12 fields=["text", "id"]
13)
14pipeline.add_stage(reader)
15
16# Apply the FineWeb Mixtral Edu classifier
17classifier = FineWebMixtralEduClassifier(
18 float_score_field="fineweb-mixtral-edu-score-float", # Raw float scores
19 int_score_field="fineweb-mixtral-edu-score-int", # Rounded integer scores
20 label_field="fineweb-mixtral-edu-score-label" # "high_quality" or "low_quality"
21)
22pipeline.add_stage(classifier)
23
24# Save the results
25writer = JsonlWriter(path="mixtral_edu_classified/")
26pipeline.add_stage(writer)
27
28# Execute pipeline
29results = pipeline.run() # Uses XennaExecutor by default

Content Type Classifier

Categorizes documents into 11 distinct speech types.

1from nemo_curator.pipeline import Pipeline
2from nemo_curator.stages.text.io.reader import JsonlReader
3from nemo_curator.stages.text.io.writer import JsonlWriter
4from nemo_curator.stages.text.classifiers import ContentTypeClassifier
5
6# Create pipeline
7pipeline = Pipeline(name="content_type_classification")
8
9# Load dataset
10reader = JsonlReader(
11 file_paths="content/",
12 fields=["text", "id"]
13)
14pipeline.add_stage(reader)
15
16# Apply the Content Type classifier
17classifier = ContentTypeClassifier(filter_by=["Blogs", "News"])
18pipeline.add_stage(classifier)
19
20# Save the results
21writer = JsonlWriter(path="content_type_classified/")
22pipeline.add_stage(writer)
23
24# Execute pipeline
25results = pipeline.run() # Uses XennaExecutor by default

Prompt Task and Complexity Classifier

Classifies prompts by task type and complexity dimensions.

1from nemo_curator.pipeline import Pipeline
2from nemo_curator.stages.text.io.reader import JsonlReader
3from nemo_curator.stages.text.io.writer import JsonlWriter
4from nemo_curator.stages.text.classifiers import PromptTaskComplexityClassifier
5
6# Create pipeline
7pipeline = Pipeline(name="prompt_task_complexity_classification")
8
9# Load dataset
10reader = JsonlReader(
11 file_paths="prompts/",
12 fields=["text", "id"]
13)
14pipeline.add_stage(reader)
15
16# Apply the Prompt Task Complexity classifier
17classifier = PromptTaskComplexityClassifier()
18pipeline.add_stage(classifier)
19
20# Save the results
21writer = JsonlWriter(path="prompt_complexity_classified/")
22pipeline.add_stage(writer)
23
24# Execute pipeline
25results = pipeline.run() # Uses XennaExecutor by default

Custom Model Integration

You can integrate your own classification models by extending DistributedDataClassifier. Refer to the Text Classifiers README for implementation details and examples.

Performance Optimization

NVIDIA NeMo Curator’s distributed classifiers are optimized for high-throughput processing through several key features:

Intelligent Batching and Sequence Handling

The classifiers optimize throughput through:

  • Length-based sorting: Input sequences are sorted by length when sort_by_length=True (default)
  • Efficient batching: Similar-length sequences are grouped together to minimize padding overhead
  • GPU memory optimization: Batches are sized to maximize GPU utilization based on available memory