About NeMo CuratorRelease Notes

NeMo Curator Release Notes: 26.02

View as Markdown

What’s New in 26.02

Benchmarking Infrastructure

New comprehensive benchmarking framework for performance monitoring and optimization:

  • End-to-End Pipeline Benchmarking: Automated benchmarks for all curation modalities (text, image, video, audio)
  • Performance Tracking: Integration with MLflow for metrics tracking and Slack for notifications
  • Nightly Benchmarks: Continuous performance monitoring across:
    • Text pipelines: exact deduplication, fuzzy deduplication, semantic deduplication, score filters, modifiers
    • Image curation workflows with DALI-based processing
    • Video processing pipelines with scene detection and semantic deduplication
    • Audio ASR inference and quality assessment
  • Grafana Dashboards: Real-time monitoring of pipeline performance and resource utilization

Ray Actor Pool Executor Improvements

Enhanced features for the experimental Ray Actor Pool execution backend:

  • Progress Bars: New visual feedback for long-running actor pool operations, making it easier to monitor pipeline execution
  • Improved Load Balancing: Better worker distribution and task scheduling
  • Enhanced Stability: Continued refinements to the experimental executor

Learn more in the Execution Backends documentation.

YAML Configuration Support

Declarative pipeline configuration for text curation workflows:

  • YAML-Based Pipelines: Define entire curation pipelines in YAML configuration files
  • Pre-Built Configurations: Ready-to-use configs for common workflows:
    • Code filtering, exact/fuzzy/semantic deduplication
    • Heuristic filtering (English and non-English)
    • FastText language identification
  • Reproducible Workflows: Version-controlled pipeline definitions for consistent results

Example:

$python -m nemo_curator.config.run --config_file heuristic_filter_english_pipeline.yaml

Workflow Results API

New API for tracking and analyzing pipeline execution:

  • Performance Metrics: Automatic tracking of processing time, throughput, and resource usage
  • Better Debugging: Detailed logs and error reporting for failed stages

Improvements from 25.09

Video Curation

  • Model Updates: Removed InternVideo2 dependency; updated to more performant alternatives
  • vLLM 0.14.1: Upgraded for better video captioning compatibility and performance
  • FFmpeg 8.0.1: Latest FFmpeg with improved codec support and performance
  • Enhanced Tutorials: Improved video processing examples with real-world scenarios

Audio Curation

  • Enhanced Documentation: Comprehensive ASR inference and quality assessment guides
  • Improved WER Filtering: Better guidance for Word Error Rate filtering thresholds
  • Manifest Handling: More robust JSONL manifest processing for large audio datasets

Image Curation

  • Optimized Batch Sizes: Reduced default batch sizes for better CPU memory usage (batch_size=50, num_threads=4)
  • Memory Guidance: Added troubleshooting documentation for out-of-memory errors
  • Tutorial Improvements: Updated examples optimized for typical GPU configurations

Text Curation

  • Better Memory Management: Improved handling of large-scale semantic deduplication
  • Small Cluster Warnings: Automatic warnings when n_clusters is too small for effective deduplication
  • FilePartitioning Improvements: One worker per partition for better parallelization

Deduplication Enhancements

  • Cloud Storage Support: Fixed ParquetReader/Writer and pairwise I/O for S3, GCS, and Azure Blob
  • Non-Blocking ID Generation: Improved ID generator performance for large datasets
  • Empty Batch Handling: Better error handling for filters processing empty data batches

Dependency Updates

  • Transformers: Pinned to 4.55.2 for stability and compatibility
  • vLLM: Updated to 0.14.1 with video pipeline compatibility fixes
  • FFmpeg: Upgraded to 8.0.1 for enhanced multimedia processing
  • Security Patches:
    • Addressed CVEs in aiohttp, urllib3, python-multipart, setuptools
    • Removed vulnerable thirdparty aiohttp file from Ray
    • Updated to secure dependency versions

Bug Fixes

  • Fixed fasttext predict call compatibility with numpy>2
  • Fixed broken NeMo Framework documentation links
  • Fixed MegatronTokenizerWriter to download only necessary tokenizer files
  • Fixed ID generator blocking issues for large-scale processing
  • Fixed vLLM API compatibility with video captioning pipeline
  • Fixed Gliner tutorial examples and SDG workflow bugs
  • Improved semantic deduplication unit test reliability

Infrastructure & Developer Experience

  • Secrets Detection: Automated secret scanning in CI/CD workflows
  • Dependabot Integration: Automatic dependency update pull requests
  • Enhanced Install Tests: Comprehensive installation validation across environments
  • AWS Runner Support: CI/CD execution on AWS infrastructure
  • Docker Optimization: Improved layer caching and build times with uv
  • Code Linting: Standardized code quality checks with markdownlint and pre-commit hooks
  • Cursor Rules: Development guidelines and patterns for IDE assistance

Breaking Changes

  • InternVideo2 Removed: Video pipelines must use alternative embedding models (Cosmos-Embed1)

Documentation Improvements

  • Heuristic Filter Guide: Comprehensive documentation for language-specific filtering strategies
  • Distributed Classifier: Enhanced GPU memory optimization guidance with length-based sequence sorting
  • Installation Guide: Clearer instructions with troubleshooting for common issues
  • Memory Management: New guidance for handling CPU/GPU memory constraints
  • AWS Integration: Updated tutorials with correct AWS credentials setup

What’s Next

Future releases will focus on:

  • Code Curation: Specialized pipelines for curating code datasets
  • Math Curation: Mathematical reasoning and problem-solving data curation
  • Generation Features: Completing the Ray refactor for synthetic data generation
  • PII Processing: Enhanced privacy-preserving data curation with Ray backend
  • Blending & Shuffling: Large-scale multi-source dataset blending and shuffling operations