NeMo Curator Release Notes: 25.09#

This major release represents a fundamental architecture shift from Dask to Ray, expanding NeMo Curator to support multimodal data curation with new video and audio capabilities. This refactor enables unified backend processing, better heterogeneous computing support, and enhanced autoscaling for dynamic workloads.

Migrating from a previous version of NeMo Curator? Refer to the Migration Guide for step-by-step instructions and the Migration FAQ for common questions.

Installation Updates#

New Docker container: Updated Docker infrastructure with CUDA 12.8.1 and Ubuntu 24.04 base; obtainable through the NGC Catalog (nvcr.io/nvidia/nemo-curator:25.09)
Docker file to build own image: Simplified Dockerfile structure for custom container builds with FFmpeg support
UV source installations: Integrated UV package manager (v0.8.22) for faster dependency management

PyPI improvements: Enhanced PyPI installation with modular extras for targeted functionality:

Table 2 Available Installation Extras#
Extra	Installation Command	Description
All Modalities	`nemo-curator[all]`	Complete installation with all modalities and GPU support
Text Curation	`nemo-curator[text_cuda12]`	GPU-accelerated text processing with RAPIDS
Image Curation	`nemo-curator[image_cuda12]`	Image processing with NVIDIA DALI
Audio Curation	`nemo-curator[audio_cuda12]`	Speech recognition with NeMo ASR models
Video Curation	`nemo-curator[video_cuda12]`	Video processing with GPU acceleration
Basic GPU	`nemo-curator[cuda12]`	CUDA utilities without modality-specific dependencies

All GPU installations require the NVIDIA PyPI index:

uv pip install https://pypi.nvidia.com nemo-curator[EXTRA]

New Modalities#

Video#

NeMo Curator now supports comprehensive video data curation with distributed processing capabilities:

Video splitting: Fixed-stride and scene-change detection (TransNetV2) for clip extraction
Semantic deduplication: K-means clustering and pairwise similarity for near-duplicate clip removal
Content filtering: Motion-based filtering and aesthetic filtering for quality improvement
Embedding generation: InternVideo2 and Cosmos-Embed1 models for clip-level embeddings
Enhanced captioning: VL-based caption generation with optional LLM-based rewriting (Qwen-VL and Qwen-LM supported) for detailed video descriptions
Ray-based distributed architecture: Scalable video processing with autoscaling support

Audio#

New audio curation capabilities for speech data processing:

ASR inference: Automatic speech recognition using NeMo Framework pretrained models
Quality assessment: Word Error Rate (WER) and Character Error Rate (CER) calculation
Speech metrics: Duration analysis and speech rate metrics (words/characters per second)
Text integration: Seamless integration with text curation workflows via AudioToDocumentStage
Manifest support: JSONL manifest format for audio file management

Modality Refactors#

Text#

Ray backend migration: Complete transition from Dask to Ray for distributed text processing
Improved model-based classifier throughput: Better overlapping of compute between tokenization and inference through length-based sequence sorting for optimal GPU memory utilization
Task-centric architecture: New Task-based processing model for finer-grained control
Pipeline redesign: Updated ProcessingStage and Pipeline architecture with resource specification

Image#

Pipeline-based architecture: Transitioned from legacy ImageTextPairDataset to modern stage-based processing with ImageReaderStage, ImageEmbeddingStage, and filter stages
DALI-based image loading: New ImageReaderStage uses NVIDIA DALI for high-performance WebDataset tar shard processing with GPU/CPU fallback
Modular processing stages: Separate stages for embedding generation, aesthetic filtering, and NSFW filtering
Task-based data flow: Images processed as ImageBatch tasks containing ImageObject instances with metadata, embeddings, and classification scores

Learn more about image curation.

Deduplication Improvements#

Enhanced deduplication capabilities across all modalities with improved performance and flexibility:

Exact and Fuzzy deduplication: Updated rapidsmpf-based shuffle backend for more efficient GPU-to-GPU data transfer and better spilling capabilities
Semantic deduplication: Support for deduplicating text and video datasets using unified embedding-based workflows
New ranking strategies: Added RankingStrategy which allows you to rank elements within cluster centers to decide which point to prioritize during duplicate removal, supporting metadata-based ranking to prioritize specific datasets or inputs

Core Refactors#

The architecture refactor introduces a layered system with unified interfaces and multiple execution backends:

        graph LR
    subgraph "User Layer"
        P[Pipeline]
        S1[ProcessingStage X→Y]
        S2[ProcessingStage Y→Z]
        S3[ProcessingStage Z→W]
        R[Resources<br/>CPU/GPU/NVDEC/NVENC]
    end
    
    subgraph "Orchestration Layer"
        BE[BaseExecutor Interface]
    end
    
    subgraph "Backend Layer"
        XE[XennaExecutor<br/>Production Ready]
        RAP[RayActorPoolExecutor<br/>Experimental]
        RDE[RayDataExecutor<br/>Experimental]
    end
    
    subgraph "Adaptation Layer"
        XA[Xenna Adapter]
        RAPA[Ray Actor Adapter]
        RDA[Ray Data Adapter]
    end
    
    subgraph "Execution Layer"
        X[Cosmos-Xenna<br/>Streaming/Batch]
        RAY1[Ray Actor Pool<br/>Load Balancing]
        RAY2[Ray Data API<br/>Dataset Processing]
    end
    
    P --> S1
    P --> S2
    P --> S3
    S1 -.-> R
    S2 -.-> R
    S3 -.-> R
    
    P --> BE
    BE --> XE
    BE --> RAP
    BE --> RDE
    
    XE --> XA
    RAP --> RAPA
    RDE --> RDA
    
    XA --> X
    RAPA --> RAY1
    RDA --> RAY2
    
    style XE fill:#90EE90
    style RAP fill:#FFE4B5
    style RDE fill:#FFE4B5
    style P fill:#E6F3FF
    style BE fill:#F0F8FF

Pipelines#

New Pipeline API: Ray-based pipeline execution with BaseExecutor interface
Multiple backends: Support for Xenna, Ray Actor Pool, and Ray Data execution backends
Resource specification: Configurable CPU and GPU memory requirements per stage
Stage composition: Improved stage validation and execution orchestration

Stages#

ProcessingStage redesign: Generic ProcessingStage[X, Y] base class with type safety
Resource requirements: Built-in resource specification for CPU and GPU memory
Backend adapters: Stage adaptation layer for different Ray orchestration systems
Input/output validation: Enhanced type checking and data validation

Tutorials#

Text tutorials: Updated all text curation tutorials to use new Ray-based API
Image tutorials: Migrated image processing tutorials to unified backend
Audio tutorials: New audio curation tutorials
Video tutorials: New video processing tutorials

For all tutorial content, refer to the tutorials directory in the NeMo Curator GitHub repository.

Known Limitations#

(Pending Refactor in Future Release)

Generation#

Synthetic data generation: Synthetic text generation features are being refactored for Ray compatibility
Hard negative mining: Retrieval-based data generation workflows under development

PII#

PII processing: Personal Identifiable Information removal tools are being updated for Ray backend
Privacy workflows: Enhanced privacy-preserving data curation capabilities in development

Blending & Shuffling#

Data blending: Multi-source dataset blending functionality being refactored
Dataset shuffling: Large-scale data shuffling operations under development

Docs Refactor#

Local preview capability: Improved documentation build system with local preview support
Modality-specific guides: Comprehensive documentation for each supported modality (text, image, audio, video)
API reference: Complete API documentation with type annotations and examples

What’s Next#

The next release will focus on completing the refactor of Synthetic Data Generation, PII, and Blending & Shuffling features, along with additional performance optimizations and new modality support.