For AI agents: a documentation index is available at the root level at /llms.txt and /llms-full.txt. Append /llms.txt to any URL for a page-level index, or .md for the markdown version of any page.
This major release represents a fundamental architecture shift from Dask to Ray, expanding NeMo Curator to support multimodal data curation with new video and audio capabilities. This refactor enables unified backend processing, better heterogeneous computing support, and enhanced autoscaling for dynamic workloads.
Migrating from a previous version of NeMo Curator? Refer to the Migration Guide for step-by-step instructions and the Migration FAQ for common questions.
Installation Updates
New Docker container: Updated Docker infrastructure with CUDA 12.8.1 and Ubuntu 24.04 base; obtainable through the NGC Catalog (nvcr.io/nvidia/nemo-curator:25.09)
Docker file to build own image: Simplified Dockerfile structure for custom container builds with FFmpeg support
Text integration: Seamless integration with text curation workflows via AudioToDocumentStage
Manifest support: JSONL manifest format for audio file management
Modality Refactors
Text
Ray backend migration: Complete transition from Dask to Ray for distributed text processing
Improved model-based classifier throughput: Better overlapping of compute between tokenization and inference through length-based sequence sorting for optimal GPU memory utilization
Task-centric architecture: New Task-based processing model for finer-grained control
Pipeline redesign: Updated ProcessingStage and Pipeline architecture with resource specification
Image
Pipeline-based architecture: Transitioned from legacy ImageTextPairDataset to modern stage-based processing with ImageReaderStage, ImageEmbeddingStage, and filter stages
DALI-based image loading: New ImageReaderStage uses NVIDIA DALI for high-performance WebDataset tar shard processing with GPU/CPU fallback
Enhanced deduplication capabilities across all modalities with improved performance and flexibility:
Exact and Fuzzy deduplication: Updated rapidsmpf-based shuffle backend for more efficient GPU-to-GPU data transfer and better spilling capabilities
Semantic deduplication: Support for deduplicating text and video datasets using unified embedding-based workflows
New ranking strategies: Added RankingStrategy which allows you to rank elements within cluster centers to decide which point to prioritize during duplicate removal, supporting metadata-based ranking to prioritize specific datasets or inputs
Core Refactors
The architecture refactor introduces a layered system with unified interfaces and multiple execution backends:
Pipelines
New Pipeline API: Ray-based pipeline execution with BaseExecutor interface
For all tutorial content, refer to the tutorials directory in the NeMo Curator GitHub repository.
Known Limitations
(Pending Refactor in Future Release)
Generation
Synthetic data generation: Synthetic text generation features are being refactored for Ray compatibility
Hard negative mining: Retrieval-based data generation workflows under development
PII
PII processing: Personal Identifiable Information removal tools are being updated for Ray backend
Privacy workflows: Enhanced privacy-preserving data curation capabilities in development
Blending & Shuffling
Data blending: Multi-source dataset blending functionality being refactored
Dataset shuffling: Large-scale data shuffling operations under development
Docs Refactor
Local preview capability: Improved documentation build system with local preview support
Modality-specific guides: Comprehensive documentation for each supported modality (text, image, audio, video)
API reference: Complete API documentation with type annotations and examples
What’s Next
The next release will focus on completing the refactor of Synthetic Data Generation, PII, and Blending & Shuffling features, along with additional performance optimizations and new modality support.