References#

NeMo Curator’s reference documentation provides comprehensive technical details, API references, and integration information to help you maximize your NeMo Curator implementation. Use these resources to understand the technical foundation of NeMo Curator and integrate it with other tools and systems.

API Quicklinks#

Quickly access core NeMo Curator API references. Use these links to jump directly to the technical API documentation for each major module.

Classifiers API

Core classifier base classes and interfaces.

classification aegis prompt-task-complexity

../apidocs/classifiers/classifiers.html

Datasets API

APIs for document and parallel datasets.

doc-dataset parallel-dataset image-text-pair

../apidocs/datasets/datasets.html

Deduplication API

Deduplication and semantic deduplication tools.

semantic-dedup fuzzy-dedup

../apidocs/modules/modules.semantic_dedup.html

Download API

APIs for downloading and building datasets from external sources.

arxiv commoncrawl wikipedia

../apidocs/download/download.html

Filters API

Filtering and quality control APIs.

classifier-filter heuristic-filter

../apidocs/filters/filters.html

Image API

Image processing, classifiers, and embedders.

classifiers embedders

../apidocs/image/image.html

Modifiers API

Text and data modification utilities.

markdown-remover url-remover

../apidocs/modifiers/modifiers.html

Services API

Service clients and integrations.

model-client openai-client

../apidocs/services/services.html

NeMo Run API

NeMo Run integration for distributed execution.

distributed

../apidocs/nemo_run/nemo_run.html

Tasks API

Task definitions and metrics.

metrics downstream-task

../apidocs/tasks/tasks.html

Utils API

General utility functions and helpers.

text-utils distributed-utils

../apidocs/utils/utils.html

Infrastructure Components#

Explore the foundational infrastructure that powers NeMo Curator. Learn how to scale, optimize, and manage large data workflows efficiently.

Memory Management

Optimize memory usage when processing large datasets

partitioning batching monitoring

Memory Management Guide

GPU Acceleration

Leverage NVIDIA GPUs for faster data processing

cuda rmm performance

GPU Processing Guide

Resumable Processing

Continue interrupted operations across large datasets

checkpoints recovery batching

Resumable Processing

Integration & Tools#

Discover related tools and integrations in the NVIDIA AI ecosystem that complement NeMo Curator, enabling seamless workflows from data curation to model training and deployment.

Related Tools

Learn about complementary tools in the NVIDIA ecosystem

nemo-framework triton-server tao-toolkit

NVIDIA AI Ecosystem: Related Tools