API Reference#

NeMo Curator’s API reference provides comprehensive technical documentation for all modules, classes, and functions. Use these references to understand the technical foundation of NeMo Curator and integrate it with your data curation workflows.

Core Data Handling

Datasets & Download

Essential classes for loading, managing, and downloading training data from various sources.

doc-dataset parallel-dataset arxiv commoncrawl

datasets
Data Processing

Filters & Modifiers

Tools for cleaning, filtering, and transforming text data to improve quality and remove unwanted content.

classifier-filter heuristic-filter pii-modifier

filters
Classification & Analysis

AI-Powered Analysis

Advanced classification tools and image processing capabilities for content analysis and quality assessment.

aegis content-type domain-classifier

classifiers
Privacy & Security

PII Detection & Redaction

Identify and handle personally identifiable information in datasets with advanced recognition algorithms.

recognizers algorithms redaction

pii
Synthetic Data

Data Generation

Create high-quality synthetic training data using advanced language models and generation techniques.

generator nemotron mixtral

synthetic
Advanced Processing

Deduplication & Modules

Advanced processing modules including semantic deduplication, fuzzy matching, and data pipeline components.

semantic-dedup fuzzy-dedup add-id

modules