NeMo Curator Release Notes: 26.02#

What’s New in 26.02#

Stage and Pipeline Benchmarking#

Benchmarking framework for performance monitoring:

  • Stage and Pipeline Benchmarking: Automated benchmarks for curation modalities (text, image, video, audio)

  • Performance Tracking: Metrics tracking across:

    • Text pipelines: exact deduplication, fuzzy deduplication, semantic deduplication, score filters, modifiers

    • Image curation workflows with DALI-based processing

    • Video processing pipelines with splitting, scene detection, captioning, and semantic deduplication

    • Audio ASR inference and quality assessment

YAML Configuration Support#

Declarative pipeline configuration for text curation workflows:

  • YAML-Based Pipelines: Define entire curation pipelines in YAML configuration files

  • Pre-Built Configurations: Ready-to-use configs for common workflows:

    • Code filtering, exact/fuzzy/semantic deduplication

    • Heuristic filtering (English and non-English)

    • FastText language identification

  • Reproducible Workflows: Version-controlled pipeline definitions for consistent results

Example:

python run.py --config-path ./text --config-name heuristic_filter_english_pipeline.yaml input_path=./input_dir output_path=./output_dir

Pipeline Performance and Metric Logging#

Enhanced tracking of pipeline execution:

  • Performance Metrics: Automatic tracking of processing time, throughput, and resource usage

  • Better Debugging: Detailed logs and error reporting for failed stages

Improvements from 25.09#

Video Curation#

  • Model Updates: Removed InternVideo2 dependency; updated to more performant alternatives

  • vLLM 0.15.1: Upgraded for better video captioning compatibility and performance

  • FFmpeg 8.0.1: Latest FFmpeg with improved codec support and performance

  • Enhanced Tutorials: Improved video processing examples with real-world scenarios

Audio Curation#

  • Enhanced Documentation: Comprehensive ASR inference and quality assessment guides

  • Improved WER Filtering: Better guidance for Word Error Rate filtering thresholds

  • Manifest Handling: More robust JSONL manifest processing for large audio datasets

Image Curation#

  • Optimized Batch Sizes: Configurable batch sizes for better CPU/GPU memory usage (batch_size=100, num_threads=16)

  • Memory Guidance: Added troubleshooting documentation for out-of-memory errors

  • Tutorial Improvements: Updated examples optimized for typical GPU configurations

Text Curation#

  • Better Memory Management: Improved handling of large-scale semantic deduplication

Deduplication Enhancements#

  • Cloud Storage Support: Fixed ParquetReader/Writer and pairwise I/O for S3, GCS, and Azure Blob

  • Non-Blocking ID Generation: Improved ID generator performance for large datasets

  • Empty Batch Handling: Better error handling for filters processing empty data batches

Dependency Updates#

  • Transformers: Pinned to 4.55.2 for stability and compatibility

  • vLLM: Updated to 0.15.1 with video pipeline compatibility fixes

  • FFmpeg: Upgraded to 8.0.1 for enhanced multimedia processing

  • Security Patches:

    • Addressed CVEs in aiohttp, urllib3, python-multipart, setuptools

    • Removed vulnerable thirdparty aiohttp file from Ray

    • Updated to secure dependency versions

Bug Fixes#

  • Fixed fasttext predict call compatibility with numpy>2

  • Fixed broken NeMo Framework documentation links

  • Fixed ID generator blocking issues for large-scale processing

  • Fixed vLLM API compatibility with video captioning pipeline

  • Fixed Gliner tutorial examples and SDG workflow bugs

  • Improved semantic deduplication unit test reliability

Infrastructure & Developer Experience#

  • Secrets Detection: Automated secret scanning in CI/CD workflows

  • Dependabot Integration: Automatic dependency update pull requests

  • Enhanced Install Tests: Comprehensive installation validation across environments

  • AWS Runner Support: CI/CD execution on AWS infrastructure

  • Docker Optimization: Improved layer caching and build times with uv

  • Cursor Rules: Development guidelines and patterns for IDE assistance

Breaking Changes#

  • InternVideo2 Removed: Video pipelines must use alternative embedding models (Cosmos-Embed1)

Documentation Improvements#

  • Heuristic Filter Guide: Comprehensive documentation for language-specific filtering strategies

  • Distributed Classifier: Enhanced GPU memory optimization guidance with length-based sequence sorting

  • Installation Guide: Clearer instructions with troubleshooting for common issues

  • Memory Management: New guidance for handling CPU/GPU memory constraints

  • AWS Integration: Updated tutorials with correct AWS credentials setup


What’s Next#

Future releases will focus on:

  • Code Curation: Specialized pipelines for curating code datasets

  • Math Curation: Mathematical reasoning and problem-solving data curation

  • Generation Features: Completing the Ray refactor for synthetic data generation

  • PII Processing: Enhanced privacy-preserving data curation with Ray backend

  • Blending & Shuffling: Large-scale multi-source dataset blending and shuffling operations