*** description: >- Release notes and version history for NeMo Curator platform updates and new features categories: * reference tags: * release-notes * changelog * updates personas: * data-scientist-focused * mle-focused * admin-focused * devops-focused difficulty: reference content\_type: reference modality: universal *** # NeMo Curator Release Notes: 26.02 ## What's New in 26.02 ### Benchmarking Infrastructure New comprehensive benchmarking framework for performance monitoring and optimization: * **End-to-End Pipeline Benchmarking**: Automated benchmarks for all curation modalities (text, image, video, audio) * **Performance Tracking**: Integration with MLflow for metrics tracking and Slack for notifications * **Nightly Benchmarks**: Continuous performance monitoring across: * Text pipelines: exact deduplication, fuzzy deduplication, semantic deduplication, score filters, modifiers * Image curation workflows with DALI-based processing * Video processing pipelines with scene detection and semantic deduplication * Audio ASR inference and quality assessment * **Grafana Dashboards**: Real-time monitoring of pipeline performance and resource utilization ### Ray Actor Pool Executor Improvements Enhanced features for the experimental Ray Actor Pool execution backend: * **Progress Bars**: New visual feedback for long-running actor pool operations, making it easier to monitor pipeline execution * **Improved Load Balancing**: Better worker distribution and task scheduling * **Enhanced Stability**: Continued refinements to the experimental executor Learn more in the [Execution Backends documentation](/reference/infra/execution-backends). ### YAML Configuration Support Declarative pipeline configuration for text curation workflows: * **YAML-Based Pipelines**: Define entire curation pipelines in YAML configuration files * **Pre-Built Configurations**: Ready-to-use configs for common workflows: * Code filtering, exact/fuzzy/semantic deduplication * Heuristic filtering (English and non-English) * FastText language identification * **Reproducible Workflows**: Version-controlled pipeline definitions for consistent results Example: ```bash python -m nemo_curator.config.run --config_file heuristic_filter_english_pipeline.yaml ``` ### Workflow Results API New API for tracking and analyzing pipeline execution: * **Performance Metrics**: Automatic tracking of processing time, throughput, and resource usage * **Better Debugging**: Detailed logs and error reporting for failed stages ## Improvements from 25.09 ### Video Curation * **Model Updates**: Removed InternVideo2 dependency; updated to more performant alternatives * **vLLM 0.14.1**: Upgraded for better video captioning compatibility and performance * **FFmpeg 8.0.1**: Latest FFmpeg with improved codec support and performance * **Enhanced Tutorials**: Improved video processing examples with real-world scenarios ### Audio Curation * **Enhanced Documentation**: Comprehensive ASR inference and quality assessment guides * **Improved WER Filtering**: Better guidance for Word Error Rate filtering thresholds * **Manifest Handling**: More robust JSONL manifest processing for large audio datasets ### Image Curation * **Optimized Batch Sizes**: Reduced default batch sizes for better CPU memory usage (batch\_size=50, num\_threads=4) * **Memory Guidance**: Added troubleshooting documentation for out-of-memory errors * **Tutorial Improvements**: Updated examples optimized for typical GPU configurations ### Text Curation * **Better Memory Management**: Improved handling of large-scale semantic deduplication * **Small Cluster Warnings**: Automatic warnings when n\_clusters is too small for effective deduplication * **FilePartitioning Improvements**: One worker per partition for better parallelization ### Deduplication Enhancements * **Cloud Storage Support**: Fixed ParquetReader/Writer and pairwise I/O for S3, GCS, and Azure Blob * **Non-Blocking ID Generation**: Improved ID generator performance for large datasets * **Empty Batch Handling**: Better error handling for filters processing empty data batches ## Dependency Updates * **Transformers**: Pinned to 4.55.2 for stability and compatibility * **vLLM**: Updated to 0.14.1 with video pipeline compatibility fixes * **FFmpeg**: Upgraded to 8.0.1 for enhanced multimedia processing * **Security Patches**: * Addressed CVEs in aiohttp, urllib3, python-multipart, setuptools * Removed vulnerable thirdparty aiohttp file from Ray * Updated to secure dependency versions ## Bug Fixes * Fixed fasttext predict call compatibility with numpy>2 * Fixed broken NeMo Framework documentation links * Fixed MegatronTokenizerWriter to download only necessary tokenizer files * Fixed ID generator blocking issues for large-scale processing * Fixed vLLM API compatibility with video captioning pipeline * Fixed Gliner tutorial examples and SDG workflow bugs * Improved semantic deduplication unit test reliability ## Infrastructure & Developer Experience * **Secrets Detection**: Automated secret scanning in CI/CD workflows * **Dependabot Integration**: Automatic dependency update pull requests * **Enhanced Install Tests**: Comprehensive installation validation across environments * **AWS Runner Support**: CI/CD execution on AWS infrastructure * **Docker Optimization**: Improved layer caching and build times with uv * **Code Linting**: Standardized code quality checks with markdownlint and pre-commit hooks * **Cursor Rules**: Development guidelines and patterns for IDE assistance ## Breaking Changes * **InternVideo2 Removed**: Video pipelines must use alternative embedding models (Cosmos-Embed1) ## Documentation Improvements * **Heuristic Filter Guide**: Comprehensive documentation for language-specific filtering strategies * **Distributed Classifier**: Enhanced GPU memory optimization guidance with length-based sequence sorting * **Installation Guide**: Clearer instructions with troubleshooting for common issues * **Memory Management**: New guidance for handling CPU/GPU memory constraints * **AWS Integration**: Updated tutorials with correct AWS credentials setup *** ## What's Next Future releases will focus on: * **Code Curation**: Specialized pipelines for curating code datasets * **Math Curation**: Mathematical reasoning and problem-solving data curation * **Generation Features**: Completing the Ray refactor for synthetic data generation * **PII Processing**: Enhanced privacy-preserving data curation with Ray backend * **Blending & Shuffling**: Large-scale multi-source dataset blending and shuffling operations