NeMo Curator Release Notes: 26.04
NeMo Curator Release Notes: 26.04
NeMo Curator Release Notes: 26.04
Python 3.10 support will be removed in NeMo Curator 26.06. The 26.04 release is the last release to support Python 3.10. Plan to upgrade your environments to a newer supported Python version (3.11+) before installing 26.06. See Deprecations for details.
Added two new embedding backends for text curation, giving users flexibility to choose the best engine for their model size and throughput needs:
VLLMEmbeddingModelStage: A new standalone embedding stage powered by vLLM for high-throughput GPU-accelerated inference. Supports optional pretokenization (pretokenize=True) for best per-task throughput. Ideal for large embedding models where vLLM’s batching and memory management outperform Sentence Transformers.SentenceTransformerEmbeddingModelStage: A new embedding stage using the sentence-transformers library directly, providing native support for models from the Sentence Transformers ecosystem.EmbeddingCreatorStage enhancements: Added use_sentence_transformer flag (defaults to True) to select between Sentence Transformers’ SentenceTransformer and Hugging Face’s AutoModel classes. Added cache_dir parameter for controlling model download location.For usage details, see Text Embeddings and vLLM Embedder.
Built-in LLM serving alongside curation pipelines using Ray Serve and vLLM:
InferenceServer and InferenceModelConfig: New APIs to deploy one or more LLMs as OpenAI-compatible endpoints directly within a Ray cluster, eliminating the need for separate inference infrastructure.InferenceServer works as a context manager for automatic startup and cleanup of served models.Pipeline.run() automatically detects when an InferenceServer is active and enforces RayDataExecutor usage for GPU pipeline stages to prevent resource conflicts.GenerationConfig.extra_kwargs: New field for passing arbitrary parameters through to the OpenAI API create() call.inference_server (Ray Serve + vLLM dependencies) and sdg_cuda12 (SDG with local inference support).Learn more in the Inference Server documentation.
Moved RayDataExecutor out of the experimental namespace to nemo_curator.backends.ray_data. The executor is no longer marked as experimental, and the startup warning has been removed. Import path changes:
from nemo_curator.backends.experimental.ray_data import RayDataExecutorfrom nemo_curator.backends.ray_data import RayDataExecutorText classifiers that share the same base tokenizer can now reuse tokens across a pipeline, avoiding redundant tokenization. New parameters keep_tokens and use_existing_tokens on all distributed classifiers control this behavior. DeBERTa-based classifiers (DomainClassifier, MultilingualDomainClassifier, QualityClassifier, ContentTypeClassifier, FineWeb variants, PromptTaskComplexityClassifier) and LlamaGuard-based classifiers (AegisClassifier, InstructionDataGuardClassifier) each form a compatible tokenizer group.
Switched the default embedding backend in TextSemanticDeduplicationWorkflow from SentenceTransformers to vLLM, with google/embeddinggemma-300m as the new default model:
TextSemanticDeduplicationWorkflow now uses VLLMEmbeddingModelStage for embedding generation, replacing EmbeddingCreatorStage (SentenceTransformers).sentence-transformers/all-MiniLM-L6-v2 to google/embeddinggemma-300m.embedding_pretokenize, embedding_vllm_init_kwargs, and model_cache_dir for vLLM configuration.embedding_model_inference_batch_size, embedding_pooling, embedding_padding_side, and embedding_max_seq_length are no longer available.Improved Prometheus and Grafana monitoring support for shared clusters:
/tmp/nemo_curator_metrics_{uid}), which prevents conflicts when multiple users share a Ray cluster.metrics_dir parameter: New parameter on RayClient and start_prometheus_grafana.py for explicit control of where metrics data, PID files, and configuration are stored.RayClient.stop() now removes Ray service discovery entries from the Prometheus configuration automatically.Learn more in the Monitoring documentation.
Upgraded Cosmos-Xenna from 0.1.2 to 0.2.0 with a simplified resource model and improved GPU management:
Resources API: Removed nvdecs, nvencs, and entire_gpu fields. GPU allocation now uses gpu_memory_gb (fractional single-GPU) or gpus (one or more full GPUs) exclusively.RAY_ACCEL_ENV_VAR_OVERRIDE_ON_ZERO environment variable to the Xenna executor to prevent Ray 2.54 from overriding accelerator environment variables when num_gpus=0.Added nightly benchmarking coverage for FastText-based document filters:
fasttext_filter_raydata and fasttext_filter_xenna entries in the nightly benchmark suite, testing language identification and quality filtering pipelines.fasttext_filter_benchmark.py follows the same Hydra-configured pattern as existing filter benchmarks, reporting num_kept_documents and throughput_docs_per_sec.--fasttext-model-path to --fasttext-langid-model-path across benchmark scripts for clarity, and introduced --fasttext-quality-model-path for the new FastText quality filter model. The FastText language identification model uses the renamed --fasttext-langid-model-path argument.Reorganized the DocumentFilter and DocumentModifier directory structures to avoid eagerly importing heavy dependencies:
DocumentFilter or DocumentModifier no longer pulls in heavyweight dependencies like HuggingFace Transformers, fastText, or histogram libraries.heuristic/, token/, histogram/, and fasttext/ subdirectories.string/, unicode/, and fasttext/ subdirectories.ScoreFilter/Filter/Score moved: These stages moved from stages.text.modules to stages.text.filters.Modify moved: Moved from stages.text.modules to stages.text.modifiers.The data acquisition pipeline now uses a three-stage architecture instead of four, fusing the iterate and extract steps into a single DocumentIterateExtractStage. This reduces memory overhead and improves pipeline performance:
DocumentIterateExtractStage: Combines DocumentIterateStage and DocumentExtractStage into a single stage that iterates through downloaded files and extracts structured content in one pass.Added max_calls_per_worker support to mitigate out-of-memory errors caused by lxml/libxml2 memory fragmentation in long-running jusText extraction jobs. CommonCrawlDownloadExtractStage now defaults to extractor_max_calls_per_worker=2 for JusTextExtractor, automatically recycling workers to reclaim fragmented memory.
Pipeline stages can now declare isolated Python dependencies using Ray’s native runtime_env support. Each stage can specify a different set of pip or uv packages, and Ray creates a cached virtualenv per unique dependency set so that incompatible library versions coexist in the same pipeline:
runtime_env class variable: New ClassVar[dict | None] on ProcessingStage for declaring per-stage Python packages (e.g., {"pip": ["transformers==4.40.0"]}).with_() override: The with_() method now accepts a runtime_env parameter for per-instance overrides without modifying the stage class.XennaExecutor, RayDataExecutor, and RayActorPoolExecutor all forward runtime_env to their respective Ray dispatch mechanisms.Learn more in the Per-Stage Runtime Environments documentation.
Redesigned the audio task model and stage hierarchy for consistency with other modalities:
AudioBatch → AudioTask: Single-dict task model (Task[dict]) replacing the list-of-dicts batch model. One manifest entry per task, matching VideoTask and FileGroupTask conventions.ProcessingStage subclassing: Removed the AudioTaskStage intermediate class. All audio stages now subclass ProcessingStage[AudioTask, AudioTask] directly.backend=ray_data or backend=xenna).cache_dir parameter: InferenceAsrNemoStage accepts an optional cache_dir for NeMo model checkpoint storage.CreateInitialManifestFleursStage now declares ray_stage_spec() for correct Ray Data repartitioning.Pipeline stages now track document-level metrics through StagePerfStats.num_items_processed, so you can see how each stage affects your dataset. After calling pipeline.run(), the returned task objects expose per-stage document counts that you can use to monitor filtering behavior and tune thresholds.
Standardized return type for all deduplication workflows:
WorkflowRunResult dataclass: All deduplication workflow run() methods now return WorkflowRunResult instead of None or dictWorkflowBase abstract class: New base class that most deduplication workflows inherit from, enforcing a consistent run() → WorkflowRunResult interface. (TextSemanticDeduplicationWorkflow implements the same interface without formal inheritance.)total_time, identification_time, and others), duplicate counts (num_duplicates, num_duplicates_removed), and output paths through result.metadataTaskPerfUtils compatibility: collect_stage_metrics() and aggregate_task_metrics() now accept WorkflowRunResult directlyAdded MegatronTokenizerWriter, a writer stage that tokenizes text documents and produces the .bin and .idx files required by Megatron-LM for data loading during pretraining:
AutoTokenizer, replacing the need for Megatron’s standalone preprocess_data.py script.tokenization_batch_size parameter controls how many documents are tokenized and written per batch, preventing out-of-memory errors on large datasets.uint16) for vocabularies with 65,536 or fewer entries (such as GPT-2) and 4-byte tokens (int32) for larger vocabularies..bin and .idx files directly compatible with Megatron-LM’s MMapIndexedDataset data loader, including support for end-of-document token appending via append_eod.Learn more in the Save and Export documentation.
ImageReaderStage now works with RayDataExecutor in addition to XennaExecutor. The stage declares itself as a fanout stage, enabling Ray Data to repartition the multiple ImageBatch objects produced from each tar file across downstream workers for parallel processing.
Added tqdm progress bars to RayActorPoolExecutor for real-time visibility into task completion during stage processing and shuffle inserts. Progress bars are enabled by default and can be configured with show_progress and progress_interval parameters. This is particularly useful for long-running deduplication jobs where progress is not otherwise apparent.
New four-stage pipeline for curating audio language model training data from diarized audio segments:
ALMDataBuilderStage: Constructs fixed-duration training windows from consecutive audio segments, filtering by sample rate, bandwidth, and speaker count. Tracks loss statistics for pipeline diagnostics.ALMDataOverlapStage: Removes overlapping windows based on a configurable overlap threshold, keeping windows closest to the target duration.ALMManifestReader: Streams JSONL manifests line-by-line using fsspec, avoiding the memory overhead of loading entire files with Pandas.ALMManifestWriterStage: Writes filtered results as JSONL with single-writer concurrency for safe output.Learn more in the ALM Pipeline Concepts documentation and the ALM Tutorial.
Added stage-wise profiling for FLEURS GPU and ALM CPU pipelines with benchmark scripts integrated into the nightly CI matrix:
repeat_entries (44% of ALM wall time) as primary bottlenecks.Added a suite of audio preprocessing and filtering stages that compose into end-to-end audio data curation pipelines:
MonoConversionStage, SegmentConcatenationStage, TimestampMapperStage: Foundational preprocessing — convert multi-channel audio to mono with strict or non-strict sample-rate enforcement, concatenate audio segments with configurable silence gaps, and map filtered segments back to original-file timestamps.BandFilterStage: Classifies audio as full_band or narrow_band using a scikit-learn model on spectral features and filters out items not matching the configured band_value. Works standalone (loads audio from audio_filepath) or in-pipeline (accepts an upstream waveform).SIGMOSFilterStage: Filters audio using SIGMOS quality metrics (NOISE, OVRL, SIG, COL, DISC, LOUD, REVERB on a 0–5 scale) via an ONNX model. Each threshold is independently configurable; setting any to None disables that dimension.VADSegmentationStage: Segments audio into speech chunks using Silero VAD, fanning out to one AudioBatch per detected speech segment with start_ms, end_ms, segment_num, duration_sec, and both PyDub and torch waveform outputs. Configurable min_duration_sec, max_duration_sec, threshold, and speech_pad_ms. Runs on CPU or GPU.SpeakerSeparationStage: Diarizes audio with NeMo’s SortFormer model, fanning out to one AudioBatch per detected speaker. Configurable exclude_overlaps, min_duration, gap_threshold, and buffer_time. GPU required.UTMOSFilterStage: Filters audio segments based on UTMOS predicted Mean Opinion Score using the utmos22_strong model loaded via torch.hub from tarepan/SpeechMOS:v1.2.0. Auto-resamples to 16 kHz, accepts in-memory waveform + sample_rate or audio_filepath input, and uses a configurable mos_threshold (default 3.5; set to None to pass all).Added InferenceSortformerStage for speaker diarization using NVIDIA’s Streaming Sortformer model. Supports configurable streaming latency, RTTM output, and parallel execution via Pipeline + RayActorPoolExecutor. Evaluated on CallHome-eng0 (139 files) at 6.2% macro DER and 6.0% weighted DER (collar=0.25s).
AudioDataFilterStage Composite PipelineAdded AudioDataFilterStage, a CompositeStage that decomposes into a configurable sequence of audio processing sub-stages for extracting clean single-speaker segments from raw audio files. Each sub-stage has its own resource allocation (CPU for band/concat, GPU for VAD/UTMOS/SIGMOS/speaker separation), letting the executor parallelize across input files.
When all stages are enabled, the default order is: MonoConversion → VAD → BandFilter → UTMOS → SIGMOS → SegmentConcatenation → SpeakerSeparation → per-speaker filters (VAD + Band + UTMOS + SIGMOS) → TimestampMapper. All sub-stages are individually toggleable through AudioDataFilterConfig (enable_vad, enable_band_filter, enable_utmos, enable_sigmos, and so on). A subsequent fix in this release redesigns the stage to support all VAD/Speaker combinations and resolves a waveform tensor leak that grew memory under sustained load.
Added an end-to-end tutorial for processing the DNS Challenge ReadSpeech dataset (14,279 WAV files at 48 kHz, 19.3 hours) through AudioDataFilterStage:
CreateInitialManifestReadSpeechStage: Downloads and extracts the dataset, parses filenames for book and reader IDs, and emits AudioBatch tasks ready for the filter pipeline.tutorials/audio/readspeech/: A Python pipeline (pipeline.py), Hydra YAML config (pipeline.yaml), Hydra runner (run.py), and an extract_segments.py post-processing utility (using soundfile, no ffmpeg required for wav/flac/ogg).Integrated the NeMo Data Designer (NDD) client with NeMo Curator for declarative synthetic data generation at scale:
DataDesignerStage: New processing stage that wraps NDD’s DataDesigner.preview() to generate structured synthetic data within Curator pipelines. Accepts a DataDesignerConfigBuilder or YAML config file and supports sampler columns (Faker names, UUIDs, dates), expression columns (Jinja templates), and LLM text columns.WikipediaParaphrasingStage, DiverseQAStage, DistillStage, ExtractKnowledgeStage, KnowledgeListStage) that route generation through NDD instead of AsyncOpenAIClient. Import from nemo_curator.stages.synthetic.nemotron_cc.nemo_data_designer.nemotron_cc.DataDesignerStage automatically reports input_tokens_median_per_record and output_tokens_median_per_record from NDD’s analysis, enabling throughput tracking without manual instrumentation.InferenceServer (Ray Serve + vLLM) via custom ModelProvider and remote endpoints (NVIDIA NIM).Learn more in the NeMo Data Designer documentation.
Added S3 as an alternative transport for CommonCrawlWARCReader, which fetches individual WARC records using byte-range requests:
use_s3=True or the CC_USE_S3 environment variable. Addresses latency and throttling issues when fetching WARC records at scale over HTTPS.CC_USE_S3, CC_S3_BUCKET, and CC_S3_KEY_PREFIX environment variables provide runtime configuration without code changes.requests.Session and boto3 S3 client use double-checked locking for safe concurrent use with ThreadPoolExecutor.Learn more in the Common Crawl documentation.
Added SlurmRayClient, a drop-in replacement for RayClient that orchestrates multi-node Ray clusters under SLURM job scheduling:
SLURM_NODEID=0) starts ray start --head, writes the GCS port to a shared broadcast file, and waits for workers to join. Worker nodes block on the port file and call ray start --block. Only the head returns from start().RayClient() with SlurmRayClient(); no other changes required.tutorials/slurm/: Ships submit_container.sh (NGC container via Pyxis) and submit.sh (bare-metal via uv) covering 1- and 2-node configurations with 2 or 8 GPUs per node. Set RAY_PORT_BROADCAST_DIR to a shared filesystem path when /tmp is node-local.Verified end-to-end on 1- and 2-node H100 SLURM jobs using the nvcr.io/nvidia/nemo-curator:26.02 container.
Added a four-stage Xenna pipeline that turns PDF datasets into interleaved Parquet output using NVIDIA’s Nemotron-Parse vision-language model:
PDFPartitioningStage: Reads a JSONL manifest and packs PDF entries into FileGroupTasks.PDFPreprocessStage: Extracts PDF bytes from a directory, a CC-MAIN-2021-31-PDF-UNTRUNCATED zip hierarchy, or JSONL-encoded PDF datasets, and renders pages to images with scale-to-fit guarding against OOM on large pages.NemotronParseInferenceStage: Runs Nemotron-Parse via vLLM (recommended) or Hugging Face Transformers, with text_in_pic and enforce_eager flags and free-port retry logic on collisions.NemotronParsePostprocessStage: Parses model output, aligns images and captions, crops, and emits interleaved rows.NemotronParsePDFReader: Composite stage wrapping the four steps above so callers can drop the pipeline into a single pipeline.add_stage() call.The render timeout uses a forked subprocess instead of signal.SIGALRM, allowing the stage to run inside Xenna actor processes (which dispatch on non-main threads). Adds pypdfium2 as a new dependency and ships benchmarking/scripts/nemotron_parse_pdf_benchmark.py.
Completed the interleaved IO round-trip with new readers, writers, and shared schema utilities so all four conversions (WDS tar ⇄ InterleavedBatch ⇄ Parquet) work out of the box:
InterleavedParquetReader / InterleavedParquetReaderStage: Reads Parquet directly into InterleavedBatch, with fields= passthrough column selection (consistent with the WDS reader), push-down column projection via pq.read_schema(), and max_batch_bytes splitting that preserves per-split source-file lineage.InterleavedWebdatasetWriterStage: Writes InterleavedBatch to MINT-1T-style WDS tar shards. Uses urllib.parse.quote(sample_id, safe="") for injective, roundtrip-safe key escaping and groups rows by sample_id in a single groupby pass. Supported modalities are metadata, text, and image; any other raises ValueError at write time.utils/schema.py): New reconcile_schema(), align_table(), and resolve_schema() helpers shared by all arrow-based readers and writers. Reserved columns get canonical types and use safe=False for safe large↔small casts; passthrough columns preserve types and use safe=True to surface (rather than silently corrupt) overflow.schema= and schema_overrides= parameters with a warning when both are provided, a writer on_materialize_error= policy ("error", "warn", "drop_row", "drop_sample"), and mixed-backend safe path handling so batches that combine S3 and local paths no longer fail silently.Benchmarks on 80 NVMe shards (MINT-1T PDF data) with an aspect-ratio filter applied show Parquet-sourced paths ~5× faster than WDS-sourced paths.
Added four filter stages for cleaning interleaved image/text datasets:
InterleavedBlurFilterStage: Computes Laplacian variance on each image with OpenCV and filters out images below a configurable sharpness threshold.InterleavedQRCodeFilterStage: Detects QR codes with OpenCV and filters images whose QR-code bounding box exceeds a configurable area ratio of the total image.InterleavedCLIPScoreFilterStage: Computes CLIP image and text embeddings and filters samples whose cosine similarity falls below a minimum semantic-alignment threshold.InterleavedImageToTextRatioFilterStage: Computes the per-sample ratio of image count to text word count and filters samples outside a configurable min_ratio/max_ratio window.Added the Nemotron Nano 12B V2 VLM as a captioning backend in the video pipeline alongside the existing Qwen-VL backend. Select with --captioning-algorithm nemotron --model-dir <path> in the video_split_clip_example.py tutorial. Three precision variants ship together: nemotron / nemotron-bf16 (default BF16, auto-downloaded), nemotron-fp8 (FP8-quantized for lower memory), and nemotron-nvfp4 (NVFP4 quantization-aware-distilled checkpoint).
The exact deduplication identification stage now supports batched insertion into the shuffler, improving throughput by processing multiple file groups in a single call. This reduces actor call overhead and enables larger, more efficient GPU operations during the shuffle phase.
identification_batchsize parameter on ExactDeduplicationWorkflow: Controls how many input blocks are concatenated and inserted together. For example, an input_blocksize of 256MiB with identification_batchsize=4 processes ~1 GB of data per insertion call.read_and_insert_batch when a stage provides it, falling back to single-task processing otherwise.FuzzyDeduplicationWorkflow now exposes three parameters for controlling GPU memory and shuffle behavior during the LSH stage:
lsh_num_output_partitions: Sets the total number of output partitions for the LSH shuffle. When None (default), the partition count is chosen automatically.lsh_rmm_pool_size: Controls the RMM GPU memory pool size in bytes. Defaults to "auto" (90% of free GPU memory).lsh_spill_memory_limit: Controls the device memory limit for spilling to host. Defaults to "auto" (80% of the RMM pool size). Set to None to disable spilling.These parameters were previously hardcoded in the LSH stage and are now configurable at the workflow level, enabling finer-grained GPU memory tuning for large-scale fuzzy deduplication jobs.
Exposed additional configuration parameters in ExactDeduplicationWorkflow for fine-grained control over large-scale cluster runs:
total_nparts: Set the total number of output partitions explicitly instead of relying on the automatic default (one-third of input task count).rmm_pool_size: Configure the RMM GPU memory pool size in bytes. Supports "auto" (90% of free GPU memory), a specific byte value, or None (50% of free GPU memory with dynamic expansion).spill_memory_limit: Set the device memory threshold for spilling to host in bytes. Supports "auto" (80% of the RMM pool size), a specific byte value, or None (spilling disabled).These parameters are useful when the defaults are not optimal for your hardware or dataset size:
Confirmed full GPU utilization for the AEGIS safety classifier when running on multi-GPU setups. The AEGIS classifier, which uses the LlamaGuard-7b generative model, properly distributes inference across all available GPUs. Added a performance note to the classifier documentation to set expectations for processing times relative to encoder-based classifiers.
Resolved four HIGH-severity vulnerabilities affecting Curator dependencies:
torch.load() and pickle.load() without weights_only=True when loading model checkpoints, enabling remote code execution through maliciously crafted .nemo or .ckpt files. Curator’s InferenceAsrNemoStage calls ASRModel.from_pretrained(), which uses this deserialization path. The CVE was fixed in nemo-toolkit 2.6.1; bumped to >=2.7.2 to pick up additional fixes and ensure compatibility with Curator’s dependency set.xgrammar==0.1.29 pin to >=0.1.32.ray_dist.jar in the Ray Python package, bypassed the maxNumberLength constraint, allowing denial of service through arbitrarily long JSON numbers. Since Curator does not use Ray’s Java support, the JAR is now deleted during the Docker image build with a build-time verification guard. This fix applies only to the container image.cuda12 optional dependency group for Xenna GPU detectiontext_cpu optional dependency groupnemo_toolkit[asr] from ==2.4.0 to >=2.7.2 to address deserialization CVEs. Only affects audio_cpu and audio_cuda12 extras.constraint-dependencies (>=0.1.21) to override-dependencies (>=0.1.32) to override vLLM’s pinned version and address CVE-2026-25048.boto3>=1.35 to the math_cpu install extra for S3 range requests in CommonCrawlWARCReaderFixed out-of-memory errors during long-running jusText extraction jobs caused by lxml/libxml2 C-heap memory fragmentation. Worker recycling through max_calls_per_worker now prevents unbounded RSS growth by restarting worker processes periodically.
Fixed a RuntimeError: Cannot re-initialize CUDA in forked subprocess error that occurred when running vLLM stages with RayDataExecutor. The vLLM auto-detection for spawn versus fork multiprocessing only triggers inside Ray actors, not Ray tasks. The RayDataExecutor.execute_setup_on_node method dispatches setup_on_node as a remote task, so vLLM previously defaulted to fork and caused a CUDA reinitialization error. Fixed by setting VLLM_WORKER_MULTIPROC_METHOD=spawn in the remote task.
Fixed a race condition in video captioning stages (CaptionGenerationStage and CaptionEnhancementStage) where multiple workers simultaneously initializing vLLM caused a FileNotFoundError from the shared torch.compile cache directory. vLLM initialization now runs inside setup_on_node so the cache directory is created once per node, matching the pattern that text vLLM stages already use.
Fixed audio pipeline stage names not propagating in StagePerfStats, making benchmark output unable to identify per-stage timing. All audio stages (GetAudioDurationStage, PreserveByValueStage, AudioToDocumentStage, GetPairwiseWerStage) now correctly report their names, and stage performance history persists when stages create new task objects.
Fixed a deepcopy/pickle crash in MathContentExtractor caused by unpickleable threading.Lock and magic.Magic objects. Added __getstate__/__setstate__ methods that strip these objects before serialization and reinitialize them on deserialization. This fixes failures triggered by ProcessingStage.with_() and Ray executors.
Added __getstate__/__setstate__ to CommonCrawlWARCReader for pickle compatibility with Ray executors. The threading.Lock, requests.Session, and boto3 S3 client are stripped during serialization and lazily reinitialized after deserialization.
The image reader silently produced under-packed tars when --images-per-tar was greater than --batch-size, because each batch only emits a single chunk regardless of the tar size flag. The pipeline now warns when --images-per-tar exceeds --batch-size, and tutorials have been updated to clarify the relationship between the two flags. Addresses NVBug 6075086.
VideoReader now validates input paths during initialization, accepting single-file inputs in addition to directories and raising explicit errors for unsupported file formats instead of failing later in the pipeline.
NeMo Curator 26.04 is the last release that supports Python 3.10. Beginning with 26.06, Python 3.10 will no longer be a supported runtime, and install extras will target newer Python versions (3.11+).
If you are pinning a Python version in CI, Dockerfiles, or uv projects, update those pins now so you are ready for the 26.06 release.
TextSemanticDeduplicationWorkflow embedding backend: The default embedding backend changed from SentenceTransformers to vLLM. The default model changed from sentence-transformers/all-MiniLM-L6-v2 to google/embeddinggemma-300m. The parameters embedding_model_inference_batch_size, embedding_pooling, embedding_padding_side, and embedding_max_seq_length have been removed. Use embedding_vllm_init_kwargs to pass configuration to the vLLM backend instead.Resources API: The nvdecs, nvencs, and entire_gpu fields have been removed from Resources. Stages that previously used entire_gpu=True should use gpus=1 instead. Stages that used nvdecs or nvencs should use gpus for GPU allocation.AudioBatch removed: Replaced by AudioTask. Update imports from from nemo_curator.tasks import AudioBatch to from nemo_curator.tasks import AudioTask. Data is now a single dict instead of dict | list[dict].RayDataExecutor import path: Moved from nemo_curator.backends.experimental.ray_data to nemo_curator.backends.ray_data. Update imports accordingly.DocumentExtractStage Removed: The standalone DocumentExtractStage class has been removed. Use DocumentIterateExtractStage with an optional extractor parameter instead. The DocumentExtractor abstract base class is unchanged.DocumentIterateStage Renamed: DocumentIterateStage has been replaced by DocumentIterateExtractStage. Update imports from nemo_curator.stages.text.download.base.iterator.ExactDeduplicationWorkflow.run() and FuzzyDeduplicationWorkflow.run() now return WorkflowRunResult instead of NoneSemanticDeduplicationWorkflow.run() and TextSemanticDeduplicationWorkflow.run() now return WorkflowRunResult instead of dictTextDuplicatesRemovalWorkflow.run() now returns WorkflowRunResult instead of list[FileGroupTask] | Nonedict return from semantic workflows must migrate to result.metadata and result.pipeline_tasks.