References#
NeMo Curator’s reference documentation provides comprehensive technical details, API references, and integration information to help you maximize your NeMo Curator implementation. Use these resources to understand the technical foundation of NeMo Curator and integrate it with other tools and systems.
API Qucklinks#
Quickly access core NeMo Curator API references. Use these links to jump directly to the technical API documentation for each major module.
Core classifier base classes and interfaces.
APIs for document and parallel datasets.
Deduplication and semantic deduplication tools.
APIs for downloading and building datasets from external sources.
Filtering and quality control APIs.
Image processing, classifiers, and embedders.
Text and data modification utilities.
PII detection, recognizers, and redaction tools.
Service clients and integrations.
NeMo Run integration for distributed execution.
Synthetic data generation modules.
Task definitions and metrics.
General utility functions and helpers.
Infrastructure Components#
Explore the foundational infrastructure that powers NeMo Curator. Learn how to scale, optimize, and manage large data workflows efficiently.
Configure and manage distributed processing across multiple machines
Optimize memory usage when processing large datasets
Leverage NVIDIA GPUs for faster data processing
Continue interrupted operations across large datasets
Integration & Tools#
Discover related tools and integrations in the NVIDIA AI ecosystem that complement NeMo Curator, enabling seamless workflows from data curation to model training and deployment.
Learn about complementary tools in the NVIDIA ecosystem