Skip to main content
Ctrl+K
NeMo-Curator - Home NeMo-Curator - Home

NeMo-Curator

NeMo-Curator - Home NeMo-Curator - Home

NeMo-Curator

Table of Contents

  • Home

About NeMo Curator

  • Overview of NeMo Curator
  • Key Features
  • Concepts
    • Image Concepts
      • Data Loading
      • Data Processing
      • Data Export
    • Text Concepts
      • Curation Pipeline
      • Data Loading
      • Data Acquisition
      • Data Processing
      • Data Generation
  • Release Notes 25.07

Get Started

  • About Getting Started
  • Text Curation Quickstart
  • Image Curation Quickstart

Curate Text

  • About Text Curation
  • Tutorials
  • Load Data
    • Common Crawl
    • ArXiv
    • Wikipedia
    • Custom Data
  • Process Data
    • Quality Assessment & Filtering
      • Heuristic Filters
      • Classifier Filters
      • Distributed Classification
      • Custom Filters
    • Deduplication
      • Hash-Based Deduplication
      • Semantic Deduplication
    • Content Processing & Cleaning
      • PII Removal
      • Text Cleaning
    • Language Management
      • Language Identification
      • Stop Words
    • Specialized Processing
      • Code Processing
      • Parallel Text Processing
      • Synthetic Data Detection
      • Task Decontamination
  • Generate Data
    • Services
      • OpenAI
      • NeMo Deploy
      • Reward Models
    • Pipelines
      • Asynchronous
      • Closed Q&A
      • Dialogue
      • Distillation
      • Diverse Q&A
      • Entity Classification
      • Knowledge Extraction
      • Knowledge List
      • Math
      • Open Q&A
      • Python
      • Wikipedia Style Rewrite
      • Writing Task
      • Customizing Prompts
    • Integration with NeMo Curator

Curate Images

  • About Image Curation
  • Tutorials
  • Load Data
    • Webdataset
  • Process Data
    • Classifiers
      • Aesthetic Classifier
      • NSFW Classifier
    • Embeddings
      • TimmImageEmbedder
      • Custom ImageEmbedder
  • Save and Export

Setup & Deployment

  • About Setup & Deployment
  • Install Curator
  • Configure Curator
    • Deployment Environment Configuration
    • Storage & Credentials Configuration
    • Environment Variables Reference
  • Deploy Curator
    • Requirements
    • Running NeMo Curator on Kubernetes
    • Slurm
      • Deploy All Modalities
      • Multi-Node Setup Guide
      • Deploy Text Modality
      • Deploy Image Modality
  • Integrations
    • Spark

Reference

  • About References
  • Infrastructure
    • Distributed Computing Reference
    • Memory Management Guide
    • GPU Processing Guide
    • Resumable Processing
    • Container Environments
  • API Reference
    • datasets
      • datasets.doc_dataset
      • datasets.image_text_pair_dataset
      • datasets.parallel_dataset
    • download
      • download.arxiv
      • download.commoncrawl
      • download.doc_builder
      • download.ja_stopwords
      • download.th_stopwords
      • download.wikipedia
      • download.zh_stopwords
    • filters
      • filters.models
        • filters.models.qe_models
      • filters.bitext_filter
      • filters.classifier_filter
      • filters.code
      • filters.doc_filter
      • filters.heuristic_filter
      • filters.synthetic
    • modifiers
      • modifiers.async_llm_pii_modifier
      • modifiers.c4
      • modifiers.doc_modifier
      • modifiers.fasttext
      • modifiers.line_remover
      • modifiers.llm_pii_modifier
      • modifiers.markdown_remover
      • modifiers.newline_normalizer
      • modifiers.pii_modifier
      • modifiers.quotation_remover
      • modifiers.slicer
      • modifiers.unicode_reformatter
      • modifiers.url_remover
    • modules
      • modules.fuzzy_dedup
        • modules.fuzzy_dedup.bucketstoedges
        • modules.fuzzy_dedup.connectedcomponents
        • modules.fuzzy_dedup.fuzzyduplicates
        • modules.fuzzy_dedup.jaccardsimilarity
        • modules.fuzzy_dedup.lsh
        • modules.fuzzy_dedup.minhash
      • modules.semantic_dedup
        • modules.semantic_dedup.clusteringmodel
        • modules.semantic_dedup.embeddings
        • modules.semantic_dedup.semanticclusterleveldedup
        • modules.semantic_dedup.semdedup
      • modules.add_id
      • modules.base
      • modules.config
      • modules.dataset_ops
      • modules.exact_dedup
      • modules.filter
      • modules.joiner
      • modules.meta
      • modules.modify
      • modules.splitter
      • modules.task
      • modules.to_backend
    • classifiers
      • classifiers.aegis
      • classifiers.base
      • classifiers.content_type
      • classifiers.domain
      • classifiers.fineweb_edu
      • classifiers.prompt_task_complexity
      • classifiers.quality
    • image
      • image.classifiers
        • image.classifiers.aesthetic
        • image.classifiers.base
        • image.classifiers.nsfw
      • image.embedders
        • image.embedders.base
        • image.embedders.timm
    • pii
      • pii.recognizers
        • pii.recognizers.address_recognizer
      • pii.algorithm
      • pii.constants
      • pii.custom_batch_analyzer_engine
      • pii.custom_nlp_engine
    • synthetic
      • synthetic.async_nemotron
      • synthetic.async_nemotron_cc
      • synthetic.error
      • synthetic.generator
      • synthetic.mixtral
      • synthetic.nemotron
      • synthetic.nemotron_cc
      • synthetic.no_format
      • synthetic.prompts
    • services
      • services.conversation_formatter
      • services.model_client
      • services.nemo_client
      • services.openai_client
    • nemo_run
      • nemo_run.slurm
    • tasks
      • tasks.downstream_task
      • tasks.metrics
    • utils
      • utils.fuzzy_dedup_utils
        • utils.fuzzy_dedup_utils.id_mapping
        • utils.fuzzy_dedup_utils.io_utils
        • utils.fuzzy_dedup_utils.merge_utils
        • utils.fuzzy_dedup_utils.output_map_utils
        • utils.fuzzy_dedup_utils.shuffle_utils
      • utils.image
        • utils.image.transforms
      • utils.aegis_utils
      • utils.config_utils
      • utils.constants
      • utils.decorators
      • utils.distributed_utils
      • utils.download_utils
      • utils.duplicates_removal
      • utils.file_utils
      • utils.gpu_utils
      • utils.import_utils
      • utils.llm_pii_utils
      • utils.module_utils
      • utils.script_utils
      • utils.semdedup_utils
      • utils.text_utils
  • Tools
  • NVIDIA AI Ecosystem: Related Tools

NVIDIA AI Ecosystem: Related Tools#

After preparing your data with NeMo Curator, you’ll likely want to use it to train models. NVIDIA provides an integrated ecosystem of AI tools that work seamlessly with data prepared by NeMo Curator. This guide outlines the related tools for your next steps.

NeMo Framework#

NVIDIA NeMo is an end-to-end framework for building, training, and fine-tuning GPU-accelerated language models. It provides:

  • Pretrained model checkpoints

  • Training and inference scripts

  • Optimization techniques for large-scale deployments

Training a Tokenizer#

Tokenizers transform text into tokens that language models can interpret. While NeMo Curator doesn’t handle tokenizer training or tokenization in general, NeMo does.

Learn how to train a tokenizer using NeMo in the tokenizer training documentation.

Training Large Language Models#

Pretraining a large language model involves running next-token prediction on large curated datasets, exactly the type that NeMo Curator helps you prepare. NeMo handles everything for pretraining large language models using your curated data.

Find comprehensive information on:

  • Pretraining methodologies

  • Model evaluation

  • Parameter-efficient fine-tuning (PEFT)

  • Distributed training

In the large language model section of the NeMo user guide.

NeMo Aligner#

NVIDIA NeMo Aligner is a framework designed for aligning language models with human preferences.

After pretraining a large language model, aligning it allows you to interact with it in a chat-like setting. NeMo Aligner lets you take curated alignment data and use it to align a pretrained language model.

Learn about NeMo Aligner’s capabilities including:

  • Reinforcement Learning from Human Feedback (RLHF)

  • Direct Preference Optimization (DPO)

  • Proximal Policy Optimization (PPO)

  • Constitutional AI (CAI)

In the NeMo Aligner documentation.

NVIDIA AI Enterprise#

For organizations looking to deploy trained models to production, NVIDIA AI Enterprise provides a software platform that includes enterprise support for:

  • The complete NeMo framework

  • Pretrained foundation models

  • Deployment and inference tools

  • Enterprise-grade security and support

Complete Workflow#

A typical end-to-end workflow with NVIDIA’s AI tools includes:

  1. Data Preparation: Use NeMo Curator to clean, filter, and prepare your dataset

  2. Tokenization: Train or use a tokenizer with NeMo

  3. Model Training: Pretrain or fine-tune models with NeMo

  4. Alignment: Align models with human preferences using NeMo Aligner

  5. Deployment: Deploy models using NVIDIA AI Enterprise or Triton Inference Server

This integrated ecosystem allows you to move from raw data to deployed, production-ready models with consistent tooling and optimized performance.

previous

utils.text_utils

On this page
  • NeMo Framework
    • Training a Tokenizer
    • Training Large Language Models
  • NeMo Aligner
  • NVIDIA AI Enterprise
  • Complete Workflow
NVIDIA NVIDIA
Privacy Policy | Manage My Privacy | Do Not Sell or Share My Data | Terms of Service | Accessibility | Corporate Policies | Product Security | Contact

Copyright © 2025 NVIDIA Corporation.