NeMo Curator Release Notes: 26.04
NeMo Curator Release Notes: 26.04
Python 3.10 support will be removed in NeMo Curator 26.06. The 26.04 release is the last release to support Python 3.10. Plan to upgrade your environments to a newer supported Python version (3.11+) before installing 26.06. See Deprecations for details.
What’s New in 26.04
vLLM and Sentence Transformers Embedding Support
Added two new embedding backends for text curation, giving users flexibility to choose the best engine for their model size and throughput needs:
VLLMEmbeddingModelStage: A new standalone embedding stage powered by vLLM for high-throughput GPU-accelerated inference. Supports optional pretokenization (pretokenize=True) for best per-task throughput. Ideal for large embedding models where vLLM’s batching and memory management outperform Sentence Transformers.SentenceTransformerEmbeddingModelStage: A new embedding stage using thesentence-transformerslibrary directly, providing native support for models from the Sentence Transformers ecosystem.EmbeddingCreatorStageenhancements: Addeduse_sentence_transformerflag (defaults toTrue) to select between Sentence Transformers’SentenceTransformerand Hugging Face’sAutoModelclasses. Addedcache_dirparameter for controlling model download location.
For usage details, see Text Embeddings and vLLM Embedder.
Inference Server (Ray Serve)
Built-in LLM serving alongside curation pipelines using Ray Serve and vLLM:
InferenceServerandInferenceModelConfig: New APIs to deploy one or more LLMs as OpenAI-compatible endpoints directly within a Ray cluster, eliminating the need for separate inference infrastructure.- Context manager support:
InferenceServerworks as a context manager for automatic startup and cleanup of served models. - GPU contention detection:
Pipeline.run()automatically detects when an InferenceServer is active and enforcesRayDataExecutorusage for GPU pipeline stages to prevent resource conflicts. GenerationConfig.extra_kwargs: New field for passing arbitrary parameters through to the OpenAI APIcreate()call.- New install extras:
inference_server(Ray Serve + vLLM dependencies) andsdg_cuda12(SDG with local inference support).
Learn more in the Inference Server documentation.
RayDataExecutor Promoted from Experimental
Moved RayDataExecutor out of the experimental namespace to nemo_curator.backends.ray_data. The executor is no longer marked as experimental, and the startup warning has been removed. Import path changes:
- Before:
from nemo_curator.backends.experimental.ray_data import RayDataExecutor - After:
from nemo_curator.backends.ray_data import RayDataExecutor
Shared Tokenizer Support for Multiple Classifiers
Text classifiers that share the same base tokenizer can now reuse tokens across a pipeline, avoiding redundant tokenization. New parameters keep_tokens and use_existing_tokens on all distributed classifiers control this behavior. DeBERTa-based classifiers (DomainClassifier, MultilingualDomainClassifier, QualityClassifier, ContentTypeClassifier, FineWeb variants, PromptTaskComplexityClassifier) and LlamaGuard-based classifiers (AegisClassifier, InstructionDataGuardClassifier) each form a compatible tokenizer group.
vLLM Default for Semantic Deduplication Embeddings
Switched the default embedding backend in TextSemanticDeduplicationWorkflow from SentenceTransformers to vLLM, with google/embeddinggemma-300m as the new default model:
- vLLM embedding backend:
TextSemanticDeduplicationWorkflownow usesVLLMEmbeddingModelStagefor embedding generation, replacingEmbeddingCreatorStage(SentenceTransformers). - New default model: Changed from
sentence-transformers/all-MiniLM-L6-v2togoogle/embeddinggemma-300m. - New parameters: Added
embedding_pretokenize,embedding_vllm_init_kwargs, andmodel_cache_dirfor vLLM configuration. - Removed parameters: The SentenceTransformers-specific parameters
embedding_model_inference_batch_size,embedding_pooling,embedding_padding_side, andembedding_max_seq_lengthare no longer available.
Multi-User Metrics Isolation
Improved Prometheus and Grafana monitoring support for shared clusters:
- Per-user metrics directories: The default metrics path now includes the user ID (
/tmp/nemo_curator_metrics_{uid}), which prevents conflicts when multiple users share a Ray cluster. metrics_dirparameter: New parameter onRayClientandstart_prometheus_grafana.pyfor explicit control of where metrics data, PID files, and configuration are stored.- PID-file-based process tracking: NeMo Curator tracks Prometheus and Grafana instances through PID files instead of process-name scanning, which enables multiple isolated monitoring instances per node.
- Automatic Ray dashboard generation: The Grafana setup now auto-generates Ray default, data, serve, and serve-deployment dashboards from the built-in Ray dashboard factory.
- Graceful cleanup on shutdown:
RayClient.stop()now removes Ray service discovery entries from the Prometheus configuration automatically.
Learn more in the Monitoring documentation.
Cosmos-Xenna 0.2.0
Upgraded Cosmos-Xenna from 0.1.2 to 0.2.0 with a simplified resource model and improved GPU management:
- Simplified
ResourcesAPI: Removednvdecs,nvencs, andentire_gpufields. GPU allocation now usesgpu_memory_gb(fractional single-GPU) orgpus(one or more full GPUs) exclusively. - Xenna-managed CUDA devices: Xenna now manages CUDA device visibility directly, replacing the previous Ray-managed approach.
- Ray 2.54: Updated the minimum Ray dependency from 2.50 to 2.54 for compatibility with Cosmos-Xenna 0.2.0. Added the
RAY_ACCEL_ENV_VAR_OVERRIDE_ON_ZEROenvironment variable to the Xenna executor to prevent Ray 2.54 from overriding accelerator environment variables whennum_gpus=0.
FastText Filter Benchmarking
Added nightly benchmarking coverage for FastText-based document filters:
- FastText filter benchmarks: New
fasttext_filter_raydataandfasttext_filter_xennaentries in the nightly benchmark suite, testing language identification and quality filtering pipelines. - Dedicated benchmark script:
fasttext_filter_benchmark.pyfollows the same Hydra-configured pattern as existing filter benchmarks, reportingnum_kept_documentsandthroughput_docs_per_sec. - Consistent model path naming: Renamed
--fasttext-model-pathto--fasttext-langid-model-pathacross benchmark scripts for clarity, and introduced--fasttext-quality-model-pathfor the new FastText quality filter model. The FastText language identification model uses the renamed--fasttext-langid-model-pathargument.
Filter and Modifier Directory Reorganization
Reorganized the DocumentFilter and DocumentModifier directory structures to avoid eagerly importing heavy dependencies:
- Lazy imports: Importing
DocumentFilterorDocumentModifierno longer pulls in heavyweight dependencies like HuggingFace Transformers, fastText, or histogram libraries. - Grouped by dependency weight: Filters are now organized into
heuristic/,token/,histogram/, andfasttext/subdirectories. - Modifiers reorganized: Modifiers are now grouped into
string/,unicode/, andfasttext/subdirectories. ScoreFilter/Filter/Scoremoved: These stages moved fromstages.text.modulestostages.text.filters.Modifymoved: Moved fromstages.text.modulestostages.text.modifiers.
Fused Document Iterate and Extract Stages
The data acquisition pipeline now uses a three-stage architecture instead of four, fusing the iterate and extract steps into a single DocumentIterateExtractStage. This reduces memory overhead and improves pipeline performance:
- Fused
DocumentIterateExtractStage: CombinesDocumentIterateStageandDocumentExtractStageinto a single stage that iterates through downloaded files and extracts structured content in one pass. - Improved Memory Efficiency: The fused stage processes records inline instead of materializing intermediate DataFrames, reducing peak memory usage. With limited RAM (200 GB), the Common Crawl pipeline succeeds with 32 CPUs where the unfused pipeline ran out of memory even at 16 CPUs.
- Better Performance: Benchmarks show faster runtimes across both the Ray Data and Xenna executors (e.g., ~6% faster with Ray Data, ~18% faster with Xenna).
Worker Recycling for JusText Extraction
Added max_calls_per_worker support to mitigate out-of-memory errors caused by lxml/libxml2 memory fragmentation in long-running jusText extraction jobs. CommonCrawlDownloadExtractStage now defaults to extractor_max_calls_per_worker=2 for JusTextExtractor, automatically recycling workers to reclaim fragmented memory.
Per-Stage Runtime Environments
Pipeline stages can now declare isolated Python dependencies using Ray’s native runtime_env support. Each stage can specify a different set of pip or uv packages, and Ray creates a cached virtualenv per unique dependency set so that incompatible library versions coexist in the same pipeline:
runtime_envclass variable: NewClassVar[dict | None]onProcessingStagefor declaring per-stage Python packages (e.g.,{"pip": ["transformers==4.40.0"]}).with_()override: Thewith_()method now accepts aruntime_envparameter for per-instance overrides without modifying the stage class.- All backends supported:
XennaExecutor,RayDataExecutor, andRayActorPoolExecutorall forwardruntime_envto their respective Ray dispatch mechanisms. - Additive isolation: Base-environment packages (NeMo Curator, loguru, etc.) remain importable in isolated workers unless explicitly overridden.
Learn more in the Per-Stage Runtime Environments documentation.
Audio Task Redesign
Redesigned the audio task model and stage hierarchy for consistency with other modalities:
AudioBatch→AudioTask: Single-dict task model (Task[dict]) replacing the list-of-dicts batch model. One manifest entry per task, matchingVideoTaskandFileGroupTaskconventions.- Direct
ProcessingStagesubclassing: Removed theAudioTaskStageintermediate class. All audio stages now subclassProcessingStage[AudioTask, AudioTask]directly. - Configurable backend selection: FLEURS and ALM tutorial runners support backend selection using Hydra config (
backend=ray_dataorbackend=xenna). cache_dirparameter:InferenceAsrNemoStageaccepts an optionalcache_dirfor NeMo model checkpoint storage.- Fan-out stage fix:
CreateInitialManifestFleursStagenow declaresray_stage_spec()for correct Ray Data repartitioning.
Pipeline Stage Metrics
Pipeline stages now track document-level metrics through StagePerfStats.num_items_processed, so you can see how each stage affects your dataset. After calling pipeline.run(), the returned task objects expose per-stage document counts that you can use to monitor filtering behavior and tune thresholds.
Workflow Results API
Standardized return type for all deduplication workflows:
WorkflowRunResultdataclass: All deduplication workflowrun()methods now returnWorkflowRunResultinstead ofNoneordictWorkflowBaseabstract class: New base class that most deduplication workflows inherit from, enforcing a consistentrun()→WorkflowRunResultinterface. (TextSemanticDeduplicationWorkflowimplements the same interface without formal inheritance.)- Structured metadata: Access per-stage timing (
total_time,identification_time, and others), duplicate counts (num_duplicates,num_duplicates_removed), and output paths throughresult.metadata TaskPerfUtilscompatibility:collect_stage_metrics()andaggregate_task_metrics()now acceptWorkflowRunResultdirectly
Megatron Tokenization Writer
Added MegatronTokenizerWriter, a writer stage that tokenizes text documents and produces the .bin and .idx files required by Megatron-LM for data loading during pretraining:
- Integrated tokenization: Tokenize and export in a single pipeline step using any Hugging Face
AutoTokenizer, replacing the need for Megatron’s standalonepreprocess_data.pyscript. - Memory-efficient batching: The
tokenization_batch_sizeparameter controls how many documents are tokenized and written per batch, preventing out-of-memory errors on large datasets. - Automatic dtype selection: Uses 2-byte tokens (
uint16) for vocabularies with 65,536 or fewer entries (such as GPT-2) and 4-byte tokens (int32) for larger vocabularies. - Megatron-compatible output: Produces
.binand.idxfiles directly compatible with Megatron-LM’sMMapIndexedDatasetdata loader, including support for end-of-document token appending viaappend_eod.
Learn more in the Save and Export documentation.
Image Reader Ray Data Support
ImageReaderStage now works with RayDataExecutor in addition to XennaExecutor. The stage declares itself as a fanout stage, enabling Ray Data to repartition the multiple ImageBatch objects produced from each tar file across downstream workers for parallel processing.
Actor Pool Progress Bars
Added tqdm progress bars to RayActorPoolExecutor for real-time visibility into task completion during stage processing and shuffle inserts. Progress bars are enabled by default and can be configured with show_progress and progress_interval parameters. This is particularly useful for long-running deduplication jobs where progress is not otherwise apparent.
ALM Data Curation Pipeline
New four-stage pipeline for curating audio language model training data from diarized audio segments:
ALMDataBuilderStage: Constructs fixed-duration training windows from consecutive audio segments, filtering by sample rate, bandwidth, and speaker count. Tracks loss statistics for pipeline diagnostics.ALMDataOverlapStage: Removes overlapping windows based on a configurable overlap threshold, keeping windows closest to the target duration.ALMManifestReader: Streams JSONL manifests line-by-line using fsspec, avoiding the memory overhead of loading entire files with Pandas.ALMManifestWriterStage: Writes filtered results as JSONL with single-writer concurrency for safe output.- Hydra configuration: YAML-driven pipeline runner with command-line parameter overrides and backend selection (Xenna or Ray Data).
Learn more in the ALM Pipeline Concepts documentation and the ALM Tutorial.
Audio Stage-Wise Profiling
Added stage-wise profiling for FLEURS GPU and ALM CPU pipelines with benchmark scripts integrated into the nightly CI matrix:
- Per-stage timing breakdowns on DGX A100: FLEURS pipeline completes in approximately 100 seconds with Xenna (4.01 tasks per second) and 123 seconds with Ray Data (3.27 tasks per second).
- Bottleneck analysis: Identified data download (91% of FLEURS wall time) and
repeat_entries(44% of ALM wall time) as primary bottlenecks. - Nightly CI entries: Split FLEURS and ALM benchmarks into separate Xenna and Ray Data entries for both-backend coverage.
Audio Filtering Pipeline Stages
Added a suite of audio preprocessing and filtering stages that compose into end-to-end audio data curation pipelines:
MonoConversionStage,SegmentConcatenationStage,TimestampMapperStage: Foundational preprocessing — convert multi-channel audio to mono with strict or non-strict sample-rate enforcement, concatenate audio segments with configurable silence gaps, and map filtered segments back to original-file timestamps.BandFilterStage: Classifies audio asfull_bandornarrow_bandusing a scikit-learn model on spectral features and filters out items not matching the configuredband_value. Works standalone (loads audio fromaudio_filepath) or in-pipeline (accepts an upstream waveform).SIGMOSFilterStage: Filters audio using SIGMOS quality metrics (NOISE, OVRL, SIG, COL, DISC, LOUD, REVERB on a 0–5 scale) via an ONNX model. Each threshold is independently configurable; setting any toNonedisables that dimension.VADSegmentationStage: Segments audio into speech chunks using Silero VAD, fanning out to oneAudioBatchper detected speech segment withstart_ms,end_ms,segment_num,duration_sec, and both PyDub and torch waveform outputs. Configurablemin_duration_sec,max_duration_sec,threshold, andspeech_pad_ms. Runs on CPU or GPU.SpeakerSeparationStage: Diarizes audio with NeMo’s SortFormer model, fanning out to oneAudioBatchper detected speaker. Configurableexclude_overlaps,min_duration,gap_threshold, andbuffer_time. GPU required.UTMOSFilterStage: Filters audio segments based on UTMOS predicted Mean Opinion Score using theutmos22_strongmodel loaded viatorch.hubfromtarepan/SpeechMOS:v1.2.0. Auto-resamples to 16 kHz, accepts in-memorywaveform + sample_rateoraudio_filepathinput, and uses a configurablemos_threshold(default 3.5; set toNoneto pass all).
Streaming Sortformer Speaker Diarization
Added InferenceSortformerStage for speaker diarization using NVIDIA’s Streaming Sortformer model. Supports configurable streaming latency, RTTM output, and parallel execution via Pipeline + RayActorPoolExecutor. Evaluated on CallHome-eng0 (139 files) at 6.2% macro DER and 6.0% weighted DER (collar=0.25s).
AudioDataFilterStage Composite Pipeline
Added AudioDataFilterStage, a CompositeStage that decomposes into a configurable sequence of audio processing sub-stages for extracting clean single-speaker segments from raw audio files. Each sub-stage has its own resource allocation (CPU for band/concat, GPU for VAD/UTMOS/SIGMOS/speaker separation), letting the executor parallelize across input files.
When all stages are enabled, the default order is: MonoConversion → VAD → BandFilter → UTMOS → SIGMOS → SegmentConcatenation → SpeakerSeparation → per-speaker filters (VAD + Band + UTMOS + SIGMOS) → TimestampMapper. All sub-stages are individually toggleable through AudioDataFilterConfig (enable_vad, enable_band_filter, enable_utmos, enable_sigmos, and so on). A subsequent fix in this release redesigns the stage to support all VAD/Speaker combinations and resolves a waveform tensor leak that grew memory under sustained load.
DNS Challenge ReadSpeech Tutorial
Added an end-to-end tutorial for processing the DNS Challenge ReadSpeech dataset (14,279 WAV files at 48 kHz, 19.3 hours) through AudioDataFilterStage:
CreateInitialManifestReadSpeechStage: Downloads and extracts the dataset, parses filenames for book and reader IDs, and emitsAudioBatchtasks ready for the filter pipeline.- Tutorial materials in
tutorials/audio/readspeech/: A Python pipeline (pipeline.py), Hydra YAML config (pipeline.yaml), Hydra runner (run.py), and anextract_segments.pypost-processing utility (usingsoundfile, no ffmpeg required forwav/flac/ogg). - Benchmark and test coverage: A ReadSpeech audio curation benchmark is added to the nightly suite, and the default GPU resource requirements are lowered so the tutorial fits on smaller hosts. The dataset stage ships with a README and tests.
NeMo Data Designer Integration
Integrated the NeMo Data Designer (NDD) client with NeMo Curator for declarative synthetic data generation at scale:
DataDesignerStage: New processing stage that wraps NDD’sDataDesigner.preview()to generate structured synthetic data within Curator pipelines. Accepts aDataDesignerConfigBuilderor YAML config file and supports sampler columns (Faker names, UUIDs, dates), expression columns (Jinja templates), and LLM text columns.- NDD-backed Nemotron-CC stages: Drop-in replacements for the five Nemotron-CC stages (
WikipediaParaphrasingStage,DiverseQAStage,DistillStage,ExtractKnowledgeStage,KnowledgeListStage) that route generation through NDD instead ofAsyncOpenAIClient. Import fromnemo_curator.stages.synthetic.nemotron_cc.nemo_data_designer.nemotron_cc. - Token metric collection:
DataDesignerStageautomatically reportsinput_tokens_median_per_recordandoutput_tokens_median_per_recordfrom NDD’s analysis, enabling throughput tracking without manual instrumentation. - Local and remote inference: Supports both the built-in
InferenceServer(Ray Serve + vLLM) via customModelProviderand remote endpoints (NVIDIA NIM).
Learn more in the NeMo Data Designer documentation.
S3 Transport for CommonCrawlWARCReader
Added S3 as an alternative transport for CommonCrawlWARCReader, which fetches individual WARC records using byte-range requests:
- S3 transport via boto3: Activate with
use_s3=Trueor theCC_USE_S3environment variable. Addresses latency and throttling issues when fetching WARC records at scale over HTTPS. - Environment variable configuration:
CC_USE_S3,CC_S3_BUCKET, andCC_S3_KEY_PREFIXenvironment variables provide runtime configuration without code changes. - Thread-safe lazy initialization: Both the
requests.Sessionand boto3 S3 client use double-checked locking for safe concurrent use withThreadPoolExecutor.
Learn more in the Common Crawl documentation.
Multi-Node Ray on SLURM
Added SlurmRayClient, a drop-in replacement for RayClient that orchestrates multi-node Ray clusters under SLURM job scheduling:
- Head/worker role detection: The head node (
SLURM_NODEID=0) startsray start --head, writes the GCS port to a shared broadcast file, and waits for workers to join. Worker nodes block on the port file and callray start --block. Only the head returns fromstart(). - Drop-in replacement: Existing pipelines move from local to SLURM by replacing
RayClient()withSlurmRayClient(); no other changes required. - Reference tutorials in
tutorials/slurm/: Shipssubmit_container.sh(NGC container via Pyxis) andsubmit.sh(bare-metal via uv) covering 1- and 2-node configurations with 2 or 8 GPUs per node. SetRAY_PORT_BROADCAST_DIRto a shared filesystem path when/tmpis node-local.
Verified end-to-end on 1- and 2-node H100 SLURM jobs using the nvcr.io/nvidia/nemo-curator:26.02 container.
Nemotron-Parse PDF Pipeline
Added a four-stage Xenna pipeline that turns PDF datasets into interleaved Parquet output using NVIDIA’s Nemotron-Parse vision-language model:
PDFPartitioningStage: Reads a JSONL manifest and packs PDF entries intoFileGroupTasks.PDFPreprocessStage: Extracts PDF bytes from a directory, a CC-MAIN-2021-31-PDF-UNTRUNCATED zip hierarchy, or JSONL-encoded PDF datasets, and renders pages to images with scale-to-fit guarding against OOM on large pages.NemotronParseInferenceStage: Runs Nemotron-Parse via vLLM (recommended) or Hugging Face Transformers, withtext_in_picandenforce_eagerflags and free-port retry logic on collisions.NemotronParsePostprocessStage: Parses model output, aligns images and captions, crops, and emits interleaved rows.NemotronParsePDFReader: Composite stage wrapping the four steps above so callers can drop the pipeline into a singlepipeline.add_stage()call.
The render timeout uses a forked subprocess instead of signal.SIGALRM, allowing the stage to run inside Xenna actor processes (which dispatch on non-main threads). Adds pypdfium2 as a new dependency and ships benchmarking/scripts/nemotron_parse_pdf_benchmark.py.
Interleaved IO Round-Trip
Completed the interleaved IO round-trip with new readers, writers, and shared schema utilities so all four conversions (WDS tar ⇄ InterleavedBatch ⇄ Parquet) work out of the box:
InterleavedParquetReader/InterleavedParquetReaderStage: Reads Parquet directly intoInterleavedBatch, withfields=passthrough column selection (consistent with the WDS reader), push-down column projection viapq.read_schema(), andmax_batch_bytessplitting that preserves per-split source-file lineage.InterleavedWebdatasetWriterStage: WritesInterleavedBatchto MINT-1T-style WDS tar shards. Usesurllib.parse.quote(sample_id, safe="")for injective, roundtrip-safe key escaping and groups rows bysample_idin a singlegroupbypass. Supported modalities aremetadata,text, andimage; any other raisesValueErrorat write time.- Schema utilities (
utils/schema.py): Newreconcile_schema(),align_table(), andresolve_schema()helpers shared by all arrow-based readers and writers. Reserved columns get canonical types and usesafe=Falsefor safe large↔small casts; passthrough columns preserve types and usesafe=Trueto surface (rather than silently corrupt) overflow. - Reader/writer base improvements: New
schema=andschema_overrides=parameters with a warning when both are provided, a writeron_materialize_error=policy ("error","warn","drop_row","drop_sample"), and mixed-backend safe path handling so batches that combine S3 and local paths no longer fail silently.
Benchmarks on 80 NVMe shards (MINT-1T PDF data) with an aspect-ratio filter applied show Parquet-sourced paths ~5× faster than WDS-sourced paths.
Interleaved Dataset Filters
Added four filter stages for cleaning interleaved image/text datasets:
InterleavedBlurFilterStage: Computes Laplacian variance on each image with OpenCV and filters out images below a configurable sharpness threshold.InterleavedQRCodeFilterStage: Detects QR codes with OpenCV and filters images whose QR-code bounding box exceeds a configurable area ratio of the total image.InterleavedCLIPScoreFilterStage: Computes CLIP image and text embeddings and filters samples whose cosine similarity falls below a minimum semantic-alignment threshold.InterleavedImageToTextRatioFilterStage: Computes the per-sample ratio of image count to text word count and filters samples outside a configurablemin_ratio/max_ratiowindow.
Nemotron VLM Video Captioning
Added the Nemotron Nano 12B V2 VLM as a captioning backend in the video pipeline alongside the existing Qwen-VL backend. Select with --captioning-algorithm nemotron --model-dir <path> in the video_split_clip_example.py tutorial. Three precision variants ship together: nemotron / nemotron-bf16 (default BF16, auto-downloaded), nemotron-fp8 (FP8-quantized for lower memory), and nemotron-nvfp4 (NVFP4 quantization-aware-distilled checkpoint).
Improvements
Batched Shuffle Insertion for Exact Deduplication
The exact deduplication identification stage now supports batched insertion into the shuffler, improving throughput by processing multiple file groups in a single call. This reduces actor call overhead and enables larger, more efficient GPU operations during the shuffle phase.
- New
identification_batchsizeparameter onExactDeduplicationWorkflow: Controls how many input blocks are concatenated and inserted together. For example, aninput_blocksizeof256MiBwithidentification_batchsize=4processes ~1 GB of data per insertion call. - Batch-aware shuffle adapter: The Ray actor pool shuffle adapter now automatically uses
read_and_insert_batchwhen a stage provides it, falling back to single-task processing otherwise.
LSH Memory Configuration for Fuzzy Deduplication
FuzzyDeduplicationWorkflow now exposes three parameters for controlling GPU memory and shuffle behavior during the LSH stage:
lsh_num_output_partitions: Sets the total number of output partitions for the LSH shuffle. WhenNone(default), the partition count is chosen automatically.lsh_rmm_pool_size: Controls the RMM GPU memory pool size in bytes. Defaults to"auto"(90% of free GPU memory).lsh_spill_memory_limit: Controls the device memory limit for spilling to host. Defaults to"auto"(80% of the RMM pool size). Set toNoneto disable spilling.
These parameters were previously hardcoded in the LSH stage and are now configurable at the workflow level, enabling finer-grained GPU memory tuning for large-scale fuzzy deduplication jobs.
Exact Deduplication Workflow Tuning Parameters
Exposed additional configuration parameters in ExactDeduplicationWorkflow for fine-grained control over large-scale cluster runs:
total_nparts: Set the total number of output partitions explicitly instead of relying on the automatic default (one-third of input task count).rmm_pool_size: Configure the RMM GPU memory pool size in bytes. Supports"auto"(90% of free GPU memory), a specific byte value, orNone(50% of free GPU memory with dynamic expansion).spill_memory_limit: Set the device memory threshold for spilling to host in bytes. Supports"auto"(80% of the RMM pool size), a specific byte value, orNone(spilling disabled).
These parameters are useful when the defaults are not optimal for your hardware or dataset size:
AEGIS Classifier GPU Utilization
Confirmed full GPU utilization for the AEGIS safety classifier when running on multi-GPU setups. The AEGIS classifier, which uses the LlamaGuard-7b generative model, properly distributes inference across all available GPUs. Added a performance note to the classifier documentation to set expectations for processing times relative to encoder-based classifiers.
Security Fixes
CVE Fixes for Audio and Inference Dependencies
Resolved four HIGH-severity vulnerabilities affecting Curator dependencies:
- nemo-toolkit RCE (CVE-2025-33245, CVE-2025-33253): NeMo Toolkit versions before 2.6.1 used
torch.load()andpickle.load()withoutweights_only=Truewhen loading model checkpoints, enabling remote code execution through maliciously crafted.nemoor.ckptfiles. Curator’sInferenceAsrNemoStagecallsASRModel.from_pretrained(), which uses this deserialization path. The CVE was fixed in nemo-toolkit 2.6.1; bumped to>=2.7.2to pick up additional fixes and ensure compatibility with Curator’s dependency set. - xgrammar DoS (CVE-2026-25048): Constructing a grammar rule with deeply nested parentheses triggered a segfault via uncontrolled recursion in xgrammar’s syntax parsing, which could crash applications using vLLM structured output without authentication. Fixed by overriding vLLM’s
xgrammar==0.1.29pin to>=0.1.32. - jackson-core DoS (GHSA-72hv-8253-57qq): The non-blocking JSON parser in jackson-core 2.16.1, bundled inside
ray_dist.jarin the Ray Python package, bypassed themaxNumberLengthconstraint, allowing denial of service through arbitrarily long JSON numbers. Since Curator does not use Ray’s Java support, the JAR is now deleted during the Docker image build with a build-time verification guard. This fix applies only to the container image.
Dependency Updates
- Cosmos-Xenna: Updated from 0.1.2 to 0.2.0 with simplified resource model
- Ray: Updated minimum version from 2.50 to 2.54. This supersedes the previous constraint dependency for CVE GHSA-q279-jhrf-cc6v, which required >=2.52.
- pynvml: Added to the
cuda12optional dependency group for Xenna GPU detection - sentence-transformers: Added to the
text_cpuoptional dependency group - vllm: New vllm optional dependency group
- uv: Added minimum required version (>=0.7.0) to prevent lockfile revision drift
- nemo-toolkit: Bumped
nemo_toolkit[asr]from==2.4.0to>=2.7.2to address deserialization CVEs. Only affectsaudio_cpuandaudio_cuda12extras. - xgrammar: Moved from
constraint-dependencies(>=0.1.21) tooverride-dependencies(>=0.1.32) to override vLLM’s pinned version and address CVE-2026-25048. - boto3: Added
boto3>=1.35to themath_cpuinstall extra for S3 range requests inCommonCrawlWARCReader
Bug Fixes
JusText Extraction OOM
Fixed out-of-memory errors during long-running jusText extraction jobs caused by lxml/libxml2 C-heap memory fragmentation. Worker recycling through max_calls_per_worker now prevents unbounded RSS growth by restarting worker processes periodically.
CUDA Fork Error with vLLM and RayDataExecutor
Fixed a RuntimeError: Cannot re-initialize CUDA in forked subprocess error that occurred when running vLLM stages with RayDataExecutor. The vLLM auto-detection for spawn versus fork multiprocessing only triggers inside Ray actors, not Ray tasks. The RayDataExecutor.execute_setup_on_node method dispatches setup_on_node as a remote task, so vLLM previously defaulted to fork and caused a CUDA reinitialization error. Fixed by setting VLLM_WORKER_MULTIPROC_METHOD=spawn in the remote task.
Video vLLM Setup Race Condition
Fixed a race condition in video captioning stages (CaptionGenerationStage and CaptionEnhancementStage) where multiple workers simultaneously initializing vLLM caused a FileNotFoundError from the shared torch.compile cache directory. vLLM initialization now runs inside setup_on_node so the cache directory is created once per node, matching the pattern that text vLLM stages already use.
Audio Stage Name Propagation
Fixed audio pipeline stage names not propagating in StagePerfStats, making benchmark output unable to identify per-stage timing. All audio stages (GetAudioDurationStage, PreserveByValueStage, AudioToDocumentStage, GetPairwiseWerStage) now correctly report their names, and stage performance history persists when stages create new task objects.
MathContentExtractor Serialization Crash
Fixed a deepcopy/pickle crash in MathContentExtractor caused by unpickleable threading.Lock and magic.Magic objects. Added __getstate__/__setstate__ methods that strip these objects before serialization and reinitialize them on deserialization. This fixes failures triggered by ProcessingStage.with_() and Ray executors.
CommonCrawlWARCReader Serialization for Ray
Added __getstate__/__setstate__ to CommonCrawlWARCReader for pickle compatibility with Ray executors. The threading.Lock, requests.Session, and boto3 S3 client are stripped during serialization and lazily reinitialized after deserialization.
Images-Per-Tar Warning When Greater Than Batch Size
The image reader silently produced under-packed tars when --images-per-tar was greater than --batch-size, because each batch only emits a single chunk regardless of the tar size flag. The pipeline now warns when --images-per-tar exceeds --batch-size, and tutorials have been updated to clarify the relationship between the two flags. Addresses NVBug 6075086.
VideoReader Path Validation
VideoReader now validates input paths during initialization, accepting single-file inputs in addition to directories and raising explicit errors for unsupported file formats instead of failing later in the pipeline.
Deprecations
Python 3.10 Support Will Be Removed in 26.06
NeMo Curator 26.04 is the last release that supports Python 3.10. Beginning with 26.06, Python 3.10 will no longer be a supported runtime, and install extras will target newer Python versions (3.11+).
- Action required: Upgrade your environments to a newer supported Python version (3.11+) before installing 26.06.
- Affected surfaces: PyPI install extras, the NeMo Curator container, and all Python APIs.
- Why: Python 3.10 reaches end-of-life in October 2026, and several upstream dependencies (notably in the GPU and inference stacks) are dropping 3.10 support.
If you are pinning a Python version in CI, Dockerfiles, or uv projects, update those pins now so you are ready for the 26.06 release.
Breaking Changes
- Minimum Ray version: The minimum required Ray version increased from 2.50 to 2.54. Users on Ray 2.50–2.53 must upgrade before installing this release.
TextSemanticDeduplicationWorkflowembedding backend: The default embedding backend changed from SentenceTransformers to vLLM. The default model changed fromsentence-transformers/all-MiniLM-L6-v2togoogle/embeddinggemma-300m. The parametersembedding_model_inference_batch_size,embedding_pooling,embedding_padding_side, andembedding_max_seq_lengthhave been removed. Useembedding_vllm_init_kwargsto pass configuration to the vLLM backend instead.ResourcesAPI: Thenvdecs,nvencs, andentire_gpufields have been removed fromResources. Stages that previously usedentire_gpu=Trueshould usegpus=1instead. Stages that usednvdecsornvencsshould usegpusfor GPU allocation.AudioBatchremoved: Replaced byAudioTask. Update imports fromfrom nemo_curator.tasks import AudioBatchtofrom nemo_curator.tasks import AudioTask. Data is now a singledictinstead ofdict | list[dict].RayDataExecutorimport path: Moved fromnemo_curator.backends.experimental.ray_datatonemo_curator.backends.ray_data. Update imports accordingly.DocumentExtractStageRemoved: The standaloneDocumentExtractStageclass has been removed. UseDocumentIterateExtractStagewith an optionalextractorparameter instead. TheDocumentExtractorabstract base class is unchanged.DocumentIterateStageRenamed:DocumentIterateStagehas been replaced byDocumentIterateExtractStage. Update imports fromnemo_curator.stages.text.download.base.iterator.- Three-Stage Pipeline: The data acquisition pipeline is now a three-step pattern (URL generation → download → iterate-extract) instead of four steps.
ExactDeduplicationWorkflow.run()andFuzzyDeduplicationWorkflow.run()now returnWorkflowRunResultinstead ofNoneSemanticDeduplicationWorkflow.run()andTextSemanticDeduplicationWorkflow.run()now returnWorkflowRunResultinstead ofdictTextDuplicatesRemovalWorkflow.run()now returnsWorkflowRunResultinstead oflist[FileGroupTask] | None- Code that previously ignored the return value is unaffected. Code that consumed the old
dictreturn from semantic workflows must migrate toresult.metadataandresult.pipeline_tasks.