Skip to main content
Ctrl+K
NeMo-Curator - Home NeMo-Curator - Home

NeMo-Curator

NeMo-Curator - Home NeMo-Curator - Home

NeMo-Curator

Table of Contents

  • Home

About NeMo Curator

  • Overview of NeMo Curator
  • Key Features
  • Concepts
    • Image Concepts
      • Data Loading
      • Data Processing
      • Data Export
    • Text Concepts
      • Curation Pipeline
      • Data Loading
      • Data Acquisition
      • Data Processing
      • Data Generation
  • Release Notes 25.07

Get Started

  • About Getting Started
  • Text Curation Quickstart
  • Image Curation Quickstart

Curate Text

  • About Text Curation
  • Tutorials
  • Load Data
    • Common Crawl
    • ArXiv
    • Wikipedia
    • Custom Data
  • Process Data
    • Quality Assessment & Filtering
      • Heuristic Filters
      • Classifier Filters
      • Distributed Classification
      • Custom Filters
    • Deduplication
      • Hash-Based Deduplication
      • Semantic Deduplication
    • Content Processing & Cleaning
      • PII Removal
      • Text Cleaning
    • Language Management
      • Language Identification
      • Stop Words
    • Specialized Processing
      • Code Processing
      • Parallel Text Processing
      • Synthetic Data Detection
      • Task Decontamination
  • Generate Data
    • Services
      • OpenAI
      • NeMo Deploy
      • Reward Models
    • Pipelines
      • Asynchronous
      • Closed Q&A
      • Dialogue
      • Distillation
      • Diverse Q&A
      • Entity Classification
      • Knowledge Extraction
      • Knowledge List
      • Math
      • Open Q&A
      • Python
      • Wikipedia Style Rewrite
      • Writing Task
      • Customizing Prompts
    • Integration with NeMo Curator

Curate Images

  • About Image Curation
  • Tutorials
  • Load Data
    • Webdataset
  • Process Data
    • Classifiers
      • Aesthetic Classifier
      • NSFW Classifier
    • Embeddings
      • TimmImageEmbedder
      • Custom ImageEmbedder
  • Save and Export

Setup & Deployment

  • About Setup & Deployment
  • Install Curator
  • Configure Curator
    • Deployment Environment Configuration
    • Storage & Credentials Configuration
    • Environment Variables Reference
  • Deploy Curator
    • Requirements
    • Running NeMo Curator on Kubernetes
    • Slurm
      • Deploy All Modalities
      • Multi-Node Setup Guide
      • Deploy Text Modality
      • Deploy Image Modality
  • Integrations
    • Spark

Reference

  • About References
  • Infrastructure
    • Distributed Computing Reference
    • Memory Management Guide
    • GPU Processing Guide
    • Resumable Processing
    • Container Environments
  • API Reference
    • datasets
      • datasets.doc_dataset
      • datasets.image_text_pair_dataset
      • datasets.parallel_dataset
    • download
      • download.arxiv
      • download.commoncrawl
      • download.doc_builder
      • download.ja_stopwords
      • download.th_stopwords
      • download.wikipedia
      • download.zh_stopwords
    • filters
      • filters.models
        • filters.models.qe_models
      • filters.bitext_filter
      • filters.classifier_filter
      • filters.code
      • filters.doc_filter
      • filters.heuristic_filter
      • filters.synthetic
    • modifiers
      • modifiers.async_llm_pii_modifier
      • modifiers.c4
      • modifiers.doc_modifier
      • modifiers.fasttext
      • modifiers.line_remover
      • modifiers.llm_pii_modifier
      • modifiers.markdown_remover
      • modifiers.newline_normalizer
      • modifiers.pii_modifier
      • modifiers.quotation_remover
      • modifiers.slicer
      • modifiers.unicode_reformatter
      • modifiers.url_remover
    • modules
      • modules.fuzzy_dedup
        • modules.fuzzy_dedup.bucketstoedges
        • modules.fuzzy_dedup.connectedcomponents
        • modules.fuzzy_dedup.fuzzyduplicates
        • modules.fuzzy_dedup.jaccardsimilarity
        • modules.fuzzy_dedup.lsh
        • modules.fuzzy_dedup.minhash
      • modules.semantic_dedup
        • modules.semantic_dedup.clusteringmodel
        • modules.semantic_dedup.embeddings
        • modules.semantic_dedup.semanticclusterleveldedup
        • modules.semantic_dedup.semdedup
      • modules.add_id
      • modules.base
      • modules.config
      • modules.dataset_ops
      • modules.exact_dedup
      • modules.filter
      • modules.joiner
      • modules.meta
      • modules.modify
      • modules.splitter
      • modules.task
      • modules.to_backend
    • classifiers
      • classifiers.aegis
      • classifiers.base
      • classifiers.content_type
      • classifiers.domain
      • classifiers.fineweb_edu
      • classifiers.prompt_task_complexity
      • classifiers.quality
    • image
      • image.classifiers
        • image.classifiers.aesthetic
        • image.classifiers.base
        • image.classifiers.nsfw
      • image.embedders
        • image.embedders.base
        • image.embedders.timm
    • pii
      • pii.recognizers
        • pii.recognizers.address_recognizer
      • pii.algorithm
      • pii.constants
      • pii.custom_batch_analyzer_engine
      • pii.custom_nlp_engine
    • synthetic
      • synthetic.async_nemotron
      • synthetic.async_nemotron_cc
      • synthetic.error
      • synthetic.generator
      • synthetic.mixtral
      • synthetic.nemotron
      • synthetic.nemotron_cc
      • synthetic.no_format
      • synthetic.prompts
    • services
      • services.conversation_formatter
      • services.model_client
      • services.nemo_client
      • services.openai_client
    • nemo_run
      • nemo_run.slurm
    • tasks
      • tasks.downstream_task
      • tasks.metrics
    • utils
      • utils.fuzzy_dedup_utils
        • utils.fuzzy_dedup_utils.id_mapping
        • utils.fuzzy_dedup_utils.io_utils
        • utils.fuzzy_dedup_utils.merge_utils
        • utils.fuzzy_dedup_utils.output_map_utils
        • utils.fuzzy_dedup_utils.shuffle_utils
      • utils.image
        • utils.image.transforms
      • utils.aegis_utils
      • utils.config_utils
      • utils.constants
      • utils.decorators
      • utils.distributed_utils
      • utils.download_utils
      • utils.duplicates_removal
      • utils.file_utils
      • utils.gpu_utils
      • utils.import_utils
      • utils.llm_pii_utils
      • utils.module_utils
      • utils.script_utils
      • utils.semdedup_utils
      • utils.text_utils
  • Tools
  • API Reference
  • download

download#

Submodules#

  • download.arxiv
  • download.commoncrawl
  • download.doc_builder
  • download.ja_stopwords
  • download.th_stopwords
  • download.wikipedia
  • download.zh_stopwords

previous

datasets.parallel_dataset

next

download.arxiv

On this page
  • Submodules
NVIDIA NVIDIA
Privacy Policy | Manage My Privacy | Do Not Sell or Share My Data | Terms of Service | Accessibility | Corporate Policies | Product Security | Contact

Copyright © 2025 NVIDIA Corporation.