NVIDIA NeMo Curator provides a module for performing distributed classification on large text datasets using GPU acceleration. This enables the categorization and filtering of text documents based on multiple dimensions such as domain, quality, safety, educational value, content type, and more. These classifications can enhance the quality of training data for large language models by identifying high-value content and removing problematic material.
The distributed data classification in NeMo Curator works by:
DistributedDataClassifier base classThe DistributedDataClassifier is designed to run on GPU clusters with minimal code changes regardless of which specific classifier you’re using. All classifiers support filtering based on classification results and storing prediction scores as metadata.
Distributed classification requires GPU acceleration and is not supported for CPU-only processing. As long as GPU resources are available and NeMo Curator is correctly installed, GPU acceleration is handled automatically.
Running the tutorial notebooks: The classification tutorial notebooks require the text_cuda12 or all installation extra to include all relevant dependencies. If you encounter ModuleNotFoundError, reinstall with the appropriate extra:
uv pip install “nemo-curator[text_cuda12]”
When using classifiers that download from Hugging Face (such as Aegis and InstructionDataGuard), set your HF_TOKEN environment variable to avoid rate limiting:
export HF_TOKEN=“your_token_here”
NVIDIA NeMo Curator provides a base class DistributedDataClassifier that can be extended to fit your specific model. The only requirement is that the model can fit on a single GPU. This module operates on the GPU and works within the pipeline framework using DocumentBatch processing.
The Domain Classifier categorizes English text documents into specific domains or subject areas.
Functionally similar to the Domain Classifier, but supports 52 languages.
The Quality Classifier assesses document quality using the NVIDIA Quality Classifier DeBERTa model.
The exact label categories returned by the Quality Classifier depend on the model configuration. Check the prediction column in your results to see the available labels for filtering with the filter_by parameter.
The AEGIS classifier detects unsafe content across 13 critical risk categories. It requires a HuggingFace token for access to Llama Guard.
The classifier adds a column with labels: “safe,” “O1” through “O13” (each representing specific safety risks), or “unknown.”
The AEGIS classifier relies on the LlamaGuard-7b base model, which is a generative LLM. This makes it significantly slower than the other classifiers in NeMo Curator that use encoder-based models (such as DeBERTa). Full GPU utilization is confirmed when running AEGIS on multi-GPU setups, but expect longer processing times compared to non-generative classifiers due to the autoregressive nature of the underlying model.
For raw LLM output, use:
Detects LLM poisoning attacks in instruction-response datasets. Requires HuggingFace token access.
The output includes two columns: a float score instruction_data_guard_poisoning_score and a Boolean is_poisoned.
Scores documents on educational value from 0–5. This helps prioritize content for knowledge-intensive tasks.
Similar to the FineWeb Edu Classifier but trained with different annotation sources:
Both provide a quality label column marking scores above 2.5 as “high_quality”:
Categorizes documents into 11 distinct speech types.
Classifies prompts by task type and complexity dimensions.
Many of NeMo Curator’s text classifiers use the same underlying tokenizer. When you run multiple classifiers in a single pipeline, you can tokenize once and reuse the tokens for all compatible classifiers. This avoids redundant tokenization and speeds up your pipeline.
Classifiers that share the same base tokenizer can reuse each other’s tokens:
Classifiers from different tokenizer groups are not compatible. You cannot reuse tokens generated for a DeBERTa-based classifier with an AEGIS classifier, or the reverse.
When using use_existing_tokens=True with AegisClassifier, you must also specify the aegis_prompt_field parameter. This field tells the classifier which column contains the pre-formatted prompt text. Omitting it raises a ValueError at pipeline construction time.
Each classifier exposes two parameters for token sharing:
keep_tokens (default: False): When True, the classifier preserves the input_ids and attention_mask columns in its output instead of dropping them. Set this on the first classifier in the pipeline so downstream classifiers can reuse the tokens.use_existing_tokens (default: False): When True, the classifier skips its internal tokenization step and uses the input_ids and attention_mask columns already present in the data. Set this on all subsequent classifiers.The following pipeline tokenizes the input text once with the DomainClassifier, then passes those tokens to the QualityClassifier and ContentTypeClassifier:
Set keep_tokens=True on every classifier in the chain except the last one. The last classifier can use keep_tokens=False (the default) to drop the token columns from the final output.
You can integrate your own classification models by extending DistributedDataClassifier. Refer to the Text Classifiers README for implementation details and examples.
NVIDIA NeMo Curator’s distributed classifiers are optimized for high-throughput processing through several key features:
The classifiers optimize throughput through:
sort_by_length=True (default)