Distributed Classifier | NeMo Curator

NVIDIA NeMo Curator provides a module for performing distributed classification on large text datasets using GPU acceleration. This enables the categorization and filtering of text documents based on multiple dimensions such as domain, quality, safety, educational value, content type, and more. These classifications can enhance the quality of training data for large language models by identifying high-value content and removing problematic material.

How It Works

The distributed data classification in NeMo Curator works by:

Parallel Processing: Chunking datasets across multiple computing nodes and GPUs to accelerate classification
Pre-trained Models: Using specialized models for different classification tasks
Batched Inference: Optimizing throughput with intelligent batching
Consistent API: Providing a unified interface through the DistributedDataClassifier base class

The DistributedDataClassifier is designed to run on GPU clusters with minimal code changes regardless of which specific classifier you’re using. All classifiers support filtering based on classification results and storing prediction scores as metadata.

Distributed classification requires GPU acceleration and is not supported for CPU-only processing. As long as GPU resources are available and NeMo Curator is correctly installed, GPU acceleration is handled automatically.

Running the tutorial notebooks: The classification tutorial notebooks require the text_cuda12 or all installation extra to include all relevant dependencies. If you encounter ModuleNotFoundError, reinstall with the appropriate extra:

uv pip install “nemo-curator[text_cuda12]”

When using classifiers that download from Hugging Face (such as Aegis and InstructionDataGuard), set your HF_TOKEN environment variable to avoid rate limiting:

export HF_TOKEN=“your_token_here”

Usage

NVIDIA NeMo Curator provides a base class DistributedDataClassifier that can be extended to fit your specific model. The only requirement is that the model can fit on a single GPU. This module operates on the GPU and works within the pipeline framework using DocumentBatch processing.

Classifier Comparison

Classifier	Purpose	Model Location	Key Parameters	Requirements
DomainClassifier	Assigns one of 26 domain labels (such as “Sports,” “Science,” “News”) to English text	nvidia/domain-classifier	`filter_by`, `text_field`	None
MultilingualDomainClassifier	Assigns domain labels to text in 52 languages; same labels as DomainClassifier	nvidia/multilingual-domain-classifier	`filter_by`, `text_field`	None
QualityClassifier	Rates document quality as “Low,” “Medium,” or “High” using a DeBERTa model	nvidia/quality-classifier-deberta	`filter_by`, `text_field`	None
AegisClassifier	Detects unsafe content across 13 risk categories (violence, hate speech, and others) using LlamaGuard	nvidia/Aegis-AI-Content-Safety-LlamaGuard-Defensive-1.0	`aegis_variant`, `filter_by`	HuggingFace token
InstructionDataGuardClassifier	Identifies LLM poisoning attacks in instruction-response pairs	nvidia/instruction-data-guard	`text_field`, `label_field`	HuggingFace token
FineWebEduClassifier	Scores educational value from 0 to 5 (0=spam, 5=scholarly) for training data selection	HuggingFaceFW/fineweb-edu-classifier	`label_field`, `int_field`	None
FineWebMixtralEduClassifier	Scores educational value from 0 to 5 using Mixtral 8x22B annotation data	nvidia/nemocurator-fineweb-mixtral-edu-classifier	`label_field`, `int_field`, `model_inference_batch_size=1024`	None
FineWebNemotronEduClassifier	Scores educational value from 0 to 5 using Nemotron-4-340B annotation data	nvidia/nemocurator-fineweb-nemotron-4-edu-classifier	`label_field`, `int_field`, `model_inference_batch_size=1024`	None
ContentTypeClassifier	Categorizes text into 11 speech types (such as “Blogs,” “News,” “Academic”)	nvidia/content-type-classifier-deberta	`filter_by`, `text_field`	None
PromptTaskComplexityClassifier	Labels prompts by task type (such as QA and summarization) and complexity dimensions	nvidia/prompt-task-and-complexity-classifier	`text_field`	None

Domain Classifier

The Domain Classifier categorizes English text documents into specific domains or subject areas.

1 from nemo_curator.pipeline import Pipeline
2 from nemo_curator.stages.text.io.reader import JsonlReader
3 from nemo_curator.stages.text.io.writer import JsonlWriter
4 from nemo_curator.stages.text.classifiers import DomainClassifier
5 
6 # Create pipeline
7 pipeline = Pipeline(name="domain_classification")
8 
9 # Load dataset
10 reader = JsonlReader(
11     file_paths="books_dataset/",
12     fields=["text", "id"]
13 )
14 pipeline.add_stage(reader)
15 
16 # Apply the classifier, filtering for specific domains
17 domain_classifier = DomainClassifier(filter_by=["Games", "Sports"])
18 pipeline.add_stage(domain_classifier)
19 
20 # Save the results
21 writer = JsonlWriter(path="games_and_sports/")
22 pipeline.add_stage(writer)
23 
24 # Execute pipeline
25 results = pipeline.run()  # Uses XennaExecutor by default

Multilingual Domain Classifier

Functionally similar to the Domain Classifier, but supports 52 languages.

1 from nemo_curator.pipeline import Pipeline
2 from nemo_curator.stages.text.io.reader import JsonlReader
3 from nemo_curator.stages.text.io.writer import JsonlWriter
4 from nemo_curator.stages.text.classifiers import MultilingualDomainClassifier
5 
6 pipeline = Pipeline(name="multilingual_domain_classification")
7 pipeline.add_stage(JsonlReader(file_paths="multilingual_dataset/", fields=["text", "id"]))
8 pipeline.add_stage(MultilingualDomainClassifier(filter_by=["Games", "Sports"]))
9 pipeline.add_stage(JsonlWriter(path="classified_output/"))
10 
11 results = pipeline.run()  # Uses XennaExecutor by default

Quality Classifier

The Quality Classifier assesses document quality using the NVIDIA Quality Classifier DeBERTa model.

1 from nemo_curator.pipeline import Pipeline
2 from nemo_curator.stages.text.io.reader import JsonlReader
3 from nemo_curator.stages.text.io.writer import JsonlWriter
4 from nemo_curator.stages.text.classifiers import QualityClassifier
5 
6 pipeline = Pipeline(name="quality_classification")
7 pipeline.add_stage(JsonlReader(file_paths="web_documents/", fields=["text", "id"]))
8 pipeline.add_stage(QualityClassifier())
9 pipeline.add_stage(JsonlWriter(path="quality_classified/"))
10 
11 results = pipeline.run()  # Uses XennaExecutor by default

The exact label categories returned by the Quality Classifier depend on the model configuration. Check the prediction column in your results to see the available labels for filtering with the filter_by parameter.

AEGIS Safety Classifier

The AEGIS classifier detects unsafe content across 13 critical risk categories. It requires a HuggingFace token for access to Llama Guard.

1 from nemo_curator.pipeline import Pipeline
2 from nemo_curator.stages.text.io.reader import JsonlReader
3 from nemo_curator.stages.text.io.writer import JsonlWriter
4 from nemo_curator.stages.text.classifiers import AegisClassifier
5 
6 # Create pipeline
7 pipeline = Pipeline(name="aegis_classification")
8 
9 # Load dataset
10 reader = JsonlReader(
11     file_paths="content/",
12     fields=["text", "id"]
13 )
14 pipeline.add_stage(reader)
15 
16 # Apply the AEGIS classifier
17 token = "hf_1234"  # Your HuggingFace user access token
18 safety_classifier = AegisClassifier(
19     aegis_variant="nvidia/Aegis-AI-Content-Safety-LlamaGuard-Defensive-1.0",
20     hf_token=token,
21     filter_by=["safe", "O13"]  # Keep only safe content and "needs caution" category
22 )
23 pipeline.add_stage(safety_classifier)
24 
25 # Save the results
26 writer = JsonlWriter(path="safe_content/")
27 pipeline.add_stage(writer)
28 
29 # Execute pipeline
30 results = pipeline.run()  # Uses XennaExecutor by default

The classifier adds a column with labels: “safe,” “O1” through “O13” (each representing specific safety risks), or “unknown.”

The AEGIS classifier relies on the LlamaGuard-7b base model, which is a generative LLM. This makes it significantly slower than the other classifiers in NeMo Curator that use encoder-based models (such as DeBERTa). Full GPU utilization is confirmed when running AEGIS on multi-GPU setups, but expect longer processing times compared to non-generative classifiers due to the autoregressive nature of the underlying model.

For raw LLM output, use:

1 safety_classifier = AegisClassifier(
2     aegis_variant="nvidia/Aegis-AI-Content-Safety-LlamaGuard-Defensive-1.0",
3     hf_token=token,
4     keep_raw_output=True,
5     raw_output_field="raw_predictions"
6 )

Instruction Data Guard

Detects LLM poisoning attacks in instruction-response datasets. Requires HuggingFace token access.

1 from nemo_curator.pipeline import Pipeline
2 from nemo_curator.stages.text.io.reader import JsonlReader
3 from nemo_curator.stages.text.io.writer import JsonlWriter
4 from nemo_curator.stages.text.classifiers import InstructionDataGuardClassifier
5 
6 # Create pipeline
7 pipeline = Pipeline(name="instruction_data_guard")
8 
9 # Load dataset
10 # For instruction-response data: "Instruction: {instruction}. Input: {input_}. Response: {response}."
11 reader = JsonlReader(
12     file_paths="instruction_data/",
13     fields=["text", "id"]
14 )
15 pipeline.add_stage(reader)
16 
17 # Apply the classifier
18 token = "hf_1234"  # Your HuggingFace user access token
19 classifier = InstructionDataGuardClassifier(hf_token=token)
20 pipeline.add_stage(classifier)
21 
22 # Save the results
23 writer = JsonlWriter(path="guard_classified/")
24 pipeline.add_stage(writer)
25 
26 # Execute pipeline
27 results = pipeline.run()  # Uses XennaExecutor by default

The output includes two columns: a float score instruction_data_guard_poisoning_score and a Boolean is_poisoned.

FineWeb Educational Content Classifier

Scores documents on educational value from 0–5. This helps prioritize content for knowledge-intensive tasks.

Score Ranges and Meanings

Score	Label	Description	Example Content
0-1	Very Low	No educational value	Spam, advertisements, broken content
2	Low	Minimal educational content	Simple lists, basic product descriptions
3	Moderate	Some educational value	News articles, basic how-to guides
4	High	Good educational content	Detailed tutorials, academic discussions
5	Very High	Excellent educational material	Comprehensive guides, scholarly articles

1 from nemo_curator.pipeline import Pipeline
2 from nemo_curator.stages.text.io.reader import JsonlReader
3 from nemo_curator.stages.text.io.writer import JsonlWriter
4 from nemo_curator.stages.text.classifiers import FineWebEduClassifier
5 
6 # Create pipeline
7 pipeline = Pipeline(name="fineweb_edu_classification")
8 
9 # Load dataset
10 reader = JsonlReader(
11     file_paths="web_documents/*.jsonl",
12     fields=["text", "id"]
13 )
14 pipeline.add_stage(reader)
15 
16 # Apply the FineWeb Edu classifier
17 edu_classifier = FineWebEduClassifier(
18     model_inference_batch_size=256,
19     float_score_field="fineweb-edu-score-float",  # Raw float scores
20     int_score_field="fineweb-edu-score-int",      # Rounded integer scores
21     label_field="fineweb-edu-score-label"         # Quality labels
22 )
23 pipeline.add_stage(edu_classifier)
24 
25 # Save the results
26 writer = JsonlWriter(path="edu_classified/")
27 pipeline.add_stage(writer)
28 
29 # Execute pipeline
30 results = pipeline.run()  # Uses XennaExecutor by default

FineWeb Mixtral and Nemotron Edu Classifiers

Similar to the FineWeb Edu Classifier but trained with different annotation sources:

FineWebMixtralEduClassifier: Uses annotations from Mixtral 8x22B-Instruct
FineWebNemotronEduClassifier: Uses annotations from Nemotron-4-340B-Instruct

Both provide a quality label column marking scores above 2.5 as “high_quality”:

Quality Label Mapping

Score Range	Quality Label	Description
0.0 - 2.5	`low_quality`	Below average educational value
2.5 - 5.0	`high_quality`	Above average educational value

1 from nemo_curator.pipeline import Pipeline
2 from nemo_curator.stages.text.io.reader import JsonlReader
3 from nemo_curator.stages.text.io.writer import JsonlWriter
4 from nemo_curator.stages.text.classifiers import FineWebMixtralEduClassifier  # or FineWebNemotronEduClassifier
5 
6 # Create pipeline
7 pipeline = Pipeline(name="fineweb_mixtral_edu_classification")
8 
9 # Load dataset
10 reader = JsonlReader(
11     file_paths="web_documents/*.jsonl",
12     fields=["text", "id"]
13 )
14 pipeline.add_stage(reader)
15 
16 # Apply the FineWeb Mixtral Edu classifier
17 classifier = FineWebMixtralEduClassifier(
18     float_score_field="fineweb-mixtral-edu-score-float",  # Raw float scores
19     int_score_field="fineweb-mixtral-edu-score-int",      # Rounded integer scores
20     label_field="fineweb-mixtral-edu-score-label"          # "high_quality" or "low_quality"
21 )
22 pipeline.add_stage(classifier)
23 
24 # Save the results
25 writer = JsonlWriter(path="mixtral_edu_classified/")
26 pipeline.add_stage(writer)
27 
28 # Execute pipeline
29 results = pipeline.run()  # Uses XennaExecutor by default

Content Type Classifier

Categorizes documents into 11 distinct speech types.

1 from nemo_curator.pipeline import Pipeline
2 from nemo_curator.stages.text.io.reader import JsonlReader
3 from nemo_curator.stages.text.io.writer import JsonlWriter
4 from nemo_curator.stages.text.classifiers import ContentTypeClassifier
5 
6 # Create pipeline
7 pipeline = Pipeline(name="content_type_classification")
8 
9 # Load dataset
10 reader = JsonlReader(
11     file_paths="content/",
12     fields=["text", "id"]
13 )
14 pipeline.add_stage(reader)
15 
16 # Apply the Content Type classifier
17 classifier = ContentTypeClassifier(filter_by=["Blogs", "News"])
18 pipeline.add_stage(classifier)
19 
20 # Save the results
21 writer = JsonlWriter(path="content_type_classified/")
22 pipeline.add_stage(writer)
23 
24 # Execute pipeline
25 results = pipeline.run()  # Uses XennaExecutor by default

Prompt Task and Complexity Classifier

Classifies prompts by task type and complexity dimensions.

1 from nemo_curator.pipeline import Pipeline
2 from nemo_curator.stages.text.io.reader import JsonlReader
3 from nemo_curator.stages.text.io.writer import JsonlWriter
4 from nemo_curator.stages.text.classifiers import PromptTaskComplexityClassifier
5 
6 # Create pipeline
7 pipeline = Pipeline(name="prompt_task_complexity_classification")
8 
9 # Load dataset
10 reader = JsonlReader(
11     file_paths="prompts/",
12     fields=["text", "id"]
13 )
14 pipeline.add_stage(reader)
15 
16 # Apply the Prompt Task Complexity classifier
17 classifier = PromptTaskComplexityClassifier()
18 pipeline.add_stage(classifier)
19 
20 # Save the results
21 writer = JsonlWriter(path="prompt_complexity_classified/")
22 pipeline.add_stage(writer)
23 
24 # Execute pipeline
25 results = pipeline.run()  # Uses XennaExecutor by default

Many of NeMo Curator’s text classifiers use the same underlying tokenizer. When you run multiple classifiers in a single pipeline, you can tokenize once and reuse the tokens for all compatible classifiers. This avoids redundant tokenization and speeds up your pipeline.

Compatible Tokenizer Groups

Classifiers that share the same base tokenizer can reuse each other’s tokens:

Tokenizer	Classifiers
DeBERTa-v3-base	DomainClassifier, MultilingualDomainClassifier, QualityClassifier, ContentTypeClassifier, FineWebEduClassifier, FineWebMixtralEduClassifier, FineWebNemotronEduClassifier, PromptTaskComplexityClassifier
LlamaGuard-7b	AegisClassifier, InstructionDataGuardClassifier

Classifiers from different tokenizer groups are not compatible. You cannot reuse tokens generated for a DeBERTa-based classifier with an AEGIS classifier, or the reverse.

When using use_existing_tokens=True with AegisClassifier, you must also specify the aegis_prompt_field parameter. This field tells the classifier which column contains the pre-formatted prompt text. Omitting it raises a ValueError at pipeline construction time.

How It Works

Each classifier exposes two parameters for token sharing:

keep_tokens (default: False): When True, the classifier preserves the input_ids and attention_mask columns in its output instead of dropping them. Set this on the first classifier in the pipeline so downstream classifiers can reuse the tokens.
use_existing_tokens (default: False): When True, the classifier skips its internal tokenization step and uses the input_ids and attention_mask columns already present in the data. Set this on all subsequent classifiers.

Example: Tokenize Once for Multiple Classifiers

The following pipeline tokenizes the input text once with the DomainClassifier, then passes those tokens to the QualityClassifier and ContentTypeClassifier:

1 from nemo_curator.pipeline import Pipeline
2 from nemo_curator.stages.text.io.reader import JsonlReader
3 from nemo_curator.stages.text.io.writer import JsonlWriter
4 from nemo_curator.stages.text.classifiers import (
5     DomainClassifier,
6     QualityClassifier,
7     ContentTypeClassifier,
8 )
9 
10 pipeline = Pipeline(name="shared_tokenizer_pipeline")
11 
12 # Load dataset
13 pipeline.add_stage(JsonlReader(file_paths="documents/", fields=["text", "id"]))
14 
15 # First classifier: tokenize and keep the tokens
16 pipeline.add_stage(DomainClassifier(keep_tokens=True))
17 
18 # Second classifier: reuse existing tokens, keep them for the next stage
19 pipeline.add_stage(QualityClassifier(use_existing_tokens=True, keep_tokens=True))
20 
21 # Third classifier: reuse existing tokens, drop them (last in chain)
22 pipeline.add_stage(ContentTypeClassifier(use_existing_tokens=True, keep_tokens=False))
23 
24 # Save the results
25 pipeline.add_stage(JsonlWriter(path="classified_output/"))
26 
27 results = pipeline.run()

Set keep_tokens=True on every classifier in the chain except the last one. The last classifier can use keep_tokens=False (the default) to drop the token columns from the final output.

Custom Model Integration

You can integrate your own classification models by extending DistributedDataClassifier. Refer to the Text Classifiers README for implementation details and examples.

Performance Optimization

NVIDIA NeMo Curator’s distributed classifiers are optimized for high-throughput processing through several key features:

Intelligent Batching and Sequence Handling

The classifiers optimize throughput through:

Length-based sorting: Input sequences are sorted by length when sort_by_length=True (default)
Efficient batching: Similar-length sequences are grouped together to minimize padding overhead
GPU memory optimization: Batches are sized to maximize GPU utilization based on available memory

How It Works

Usage

Classifier Comparison

Domain Classifier

Multilingual Domain Classifier

Quality Classifier

AEGIS Safety Classifier

Instruction Data Guard

FineWeb Educational Content Classifier

Score Ranges and Meanings

FineWeb Mixtral and Nemotron Edu Classifiers

Quality Label Mapping

Content Type Classifier

Prompt Task and Complexity Classifier

Share Tokens Across Multiple Classifiers

Compatible Tokenizer Groups

How It Works

Example: Tokenize Once for Multiple Classifiers

Custom Model Integration

Performance Optimization

Intelligent Batching and Sequence Handling