For AI agents: a documentation index is available at the root level at /llms.txt. Append /llms.txt to any URL for a page-level index, or .md for the markdown version of any page.
LogoLogoNeMo Curator
DocumentationAPI Reference
DocumentationAPI Reference
  • API Reference
    • Overview
    • Pipeline
    • ProcessingStage
    • CompositeStage
    • Resources
  • Overview
  • Nemo Curator
  • Backends
  • Base
  • Internal
  • Raft
  • Ray Comms
  • Ray Actor Pool
  • Adapter
  • Executor
  • Raft Adapter
  • Shuffle Adapter
  • Utils
  • Ray Data
  • Adapter
  • Executor
  • Utils
  • Utils
  • Xenna
  • Adapter
  • Executor
  • Config
  • Run
  • Core
  • Client
  • Constants
  • Serve
  • Base
  • Constants
  • Dynamo
  • Backend
  • Config
  • Constants
  • Infra
  • Vllm
  • Placement
  • Ray Serve
  • Backend
  • Config
  • Server
  • Subprocess Mgr
  • Utils
  • Metrics
  • Constants
  • Start Prometheus Grafana
  • Utils
  • Models
  • Aesthetics
  • Base
  • Client
  • Llm Client
  • Openai Client
  • Clip
  • Cosmos Embed1
  • Nemotron 3 Nano Omni
  • Nemotron H Vl
  • Nsfw
  • Prompt Formatter
  • Qwen Lm
  • Qwen Vl
  • Transnetv2
  • Vllm Model
  • Package Info
  • Pipeline
  • Pipeline
  • Workflow
  • Stages
  • Audio
  • Advanced Pipelines
  • Audio Data Filter
  • Audio Data Filter
  • Config
  • Alm
  • Alm Data Builder
  • Alm Data Overlap
  • Common
  • Datasets
  • File Utils
  • Fleurs
  • Create Initial Manifest
  • Readspeech
  • Create Initial Manifest
  • Filtering
  • Band
  • Band Filter Module
  • Features
  • Predict
  • Sigmos
  • Utmos
  • Inference
  • Asr
  • Asr Nemo
  • Sortformer
  • Speaker Diarization
  • Pyannote
  • Vad
  • Whisperx Vad
  • Io
  • Convert
  • Extract Segments
  • Metrics
  • Get Wer
  • Postprocessing
  • Timestamp Mapper
  • Preprocessing
  • Concatenation
  • Mono Conversion
  • Segmentation
  • Speaker Separation
  • Speaker Separation Module
  • Speaker Sep
  • Vad Segmentation
  • Tagging
  • Inference
  • Nemo Asr Align
  • Merge Alignment Diarization
  • Resample Audio
  • Split
  • Utils
  • Base
  • Client Partitioning
  • Deduplication
  • Exact
  • Identification
  • Workflow
  • Fuzzy
  • Buckets To Edges
  • Connected Components
  • Identify Duplicates
  • Lsh
  • Lsh
  • Stage
  • Minhash
  • Utils
  • Workflow
  • Gpu Utils
  • Id Generator
  • Io Utils
  • Semantic
  • Identify Duplicates
  • Kmeans
  • Pairwise
  • Pairwise Io
  • Ranking
  • Utils
  • Workflow
  • Shuffle Utils
  • Rapidsmpf Shuffler
  • Stage
  • File Partitioning
  • Function Decorators
  • Image
  • Deduplication
  • Removal
  • Embedders
  • Clip Embedder
  • Filters
  • Aesthetic Filter
  • Base
  • Nsfw Filter
  • Io
  • Convert
  • Image Reader
  • Image Writer
  • Interleaved
  • Filter
  • Blur Filter
  • Clip Score Filter
  • Image To Text Ratio Filter
  • Qrcode Filter
  • Io
  • Reader
  • Readers
  • Base
  • Parquet
  • Webdataset
  • Writers
  • Base
  • Tabular
  • Webdataset
  • Pdf
  • Nemotron Parse
  • Composite
  • Inference
  • Partitioning
  • Postprocess
  • Preprocess
  • Utils
  • Stages
  • Utils
  • Constants
  • Image Utils
  • Materialization
  • Schema
  • Validation Utils
  • Math
  • Classifiers
  • Finemath
  • Download
  • Extract
  • Html Extractors
  • Lynx
  • Mime Types
  • Modifiers
  • Chunking
  • Llm Cleanup
  • Merge Chunks
  • Resources
  • Synthetic
  • Nemo Data Designer
  • Data Designer
  • Nemotron Cc
  • Base
  • Nemo Data Designer
  • Base
  • Nemotron Cc
  • Nemotron Cc
  • Prompts
  • Qa Multilingual Synthetic
  • Text
  • Classifiers
  • Aegis
  • Aegis Utils
  • Base
  • Content Type
  • Domain
  • Fineweb Edu
  • Prompt Task Complexity
  • Quality
  • Utils
  • Deduplication
  • Removal
  • Removal Workflow
  • Semantic
  • Download
  • Arxiv
  • Download
  • Extract
  • Iterator
  • Stage
  • Url Generation
  • Base
  • Download
  • Extract
  • Iterator
  • Stage
  • Url Generation
  • Common Crawl
  • Download
  • Extract
  • Stage
  • Url Generation
  • Warc Iterator
  • Html Extractors
  • Base
  • Justext
  • Resiliparse
  • Trafilatura
  • Utils
  • Ja Stopwords
  • Th Stopwords
  • Zh Stopwords
  • Utils
  • Wikipedia
  • Download
  • Extract
  • Iterator
  • Stage
  • Url Generation
  • Embedders
  • Base
  • Utils
  • Vllm
  • Experimental
  • Translation
  • Backends
  • Aws
  • Base
  • Google
  • Nmt
  • Evaluation
  • Faith
  • Text Quality
  • Pipeline
  • Stages
  • Format Translation Output
  • Merge Faith Scores
  • Reassembly
  • Segmentation
  • Skipped Rows
  • Translate
  • Utils
  • Async Utils
  • Field Paths
  • Metadata
  • Prompt Loader
  • Filters
  • Doc Filter
  • Fasttext
  • Fasttext Filters
  • Heuristic
  • Code
  • Code
  • Repetition
  • Repetition
  • String
  • Histogram
  • Histogram
  • Score Filter
  • Token
  • Token Count
  • Io
  • Reader
  • Base
  • Jsonl
  • Parquet
  • Writer
  • Base
  • Jsonl
  • Megatron Tokenizer
  • Parquet
  • Utils
  • Models
  • Model
  • Tokenizer
  • Utils
  • Modifiers
  • Doc Modifier
  • Fasttext
  • Fasttext Label
  • Modifier
  • String
  • C4
  • Line Remover
  • Markdown Remover
  • Newline Normalizer
  • Quotation Remover
  • Slicer
  • Url Remover
  • Unicode
  • Unicode Reformatter
  • Modules
  • Add Id
  • Joiner
  • Splitter
  • Utils
  • Constants
  • Text Utils
  • Video
  • Caption
  • Caption Enhancement
  • Caption Generation
  • Caption Preparation
  • Clipping
  • Clip Extraction Stages
  • Clip Frame Extraction
  • Transnetv2 Extraction
  • Video Frame Extraction
  • Embedding
  • Cosmos Embed1
  • Filtering
  • Clip Aesthetic Filter
  • Motion Filter
  • Motion Vector Backend
  • Io
  • Clip Writer
  • Video Reader
  • Preview
  • Preview
  • Tasks
  • Audio Task
  • Document
  • File Group
  • Image
  • Interleaved
  • Tasks
  • Utils
  • Video
  • Utils
  • Client Utils
  • Column Utils
  • Decoder Utils
  • File Utils
  • Gpu Utils
  • Grouping
  • Hf Download Utils
  • Merge File Prefixes
  • Nvcodec Utils
  • Operation Utils
  • Performance Utils
  • Prompts
  • Ray Utils
  • Split Large Files
  • Storage Utils
  • Vllm Utils
  • Windowing Utils
  • Writer Utils
  • Pipeline
  • ProcessingStage
  • CompositeStage
  • DocumentBatch
  • ImageBatch
  • VideoTask
  • AudioTask
  • XennaExecutor
  • Experimental
  • Resources
On this page
  • Core Classes
  • Task Types
  • Executors
  • Configuration
  • Source Code
API Reference

API Reference

||View as Markdown|

This section provides API reference documentation for NeMo Curator’s core classes and interfaces.

Core Classes

Pipeline

The main orchestrator for executing sequences of processing stages.

ProcessingStage

Base class for all data processing stages in NeMo Curator.

CompositeStage

High-level stages that decompose into multiple execution stages.

Task Types

DocumentBatch

Task type for text document processing.

ImageBatch

Task type for image processing.

VideoTask

Task type for video processing.

AudioTask

Task type for audio processing.

Executors

XennaExecutor

Production executor using Cosmos-Xenna for distributed execution.

Experimental Executors

Ray-based experimental executors.

Configuration

Resources

CPU and GPU resource configuration for stages.

Source Code

For complete implementation details, see the NeMo Curator source code on GitHub.

Next

nemo_curator

NVIDIANVIDIA
Developer-friendly docs for your API
Privacy Policy | Your Privacy Choices | Terms of Service | Accessibility | Corporate Policies | Product Security | Contact

Copyright © 2026, NVIDIA Corporation.