API Reference#
NeMo Curator’s API reference provides comprehensive technical documentation for all modules, classes, and functions. Use these references to understand the technical foundation of NeMo Curator and integrate it with your data curation workflows.
Datasets & Download
Essential classes for loading, managing, and downloading training data from various sources.
doc-dataset parallel-dataset arxiv commoncrawl
Filters & Modifiers
Tools for cleaning, filtering, and transforming text data to improve quality and remove unwanted content.
classifier-filter heuristic-filter pii-modifier
AI-Powered Analysis
Advanced classification tools and image processing capabilities for content analysis and quality assessment.
aegis content-type domain-classifier
PII Detection & Redaction
Identify and handle personally identifiable information in datasets with advanced recognition algorithms.
recognizers algorithms redaction
Data Generation
Create high-quality synthetic training data using advanced language models and generation techniques.
generator nemotron mixtral
Deduplication & Modules
Advanced processing modules including semantic deduplication, fuzzy matching, and data pipeline components.
semantic-dedup fuzzy-dedup add-id