NeMo Curator Migration Guide: Dask to Ray
This guide explains how to transition existing Dask-based NeMo Curator workflows to the new Ray-based pipeline architecture.
For broader NeMo Framework migration topics, refer to the NeMo Framework 2.0 Migration Guide.
Overview
NeMo Curator previously used Dask as its primary execution engine for distributed data processing. The latest Curator architecture transitions to Ray as a unified backend, enabling all modalities—text, image, video, and audio—to use a single, consistent execution engine.
Workflows built as sequential function calls will need to be refactored into pipelines composed of modular stages. This migration guide explains how to transition existing workflows to the new modular, Ray-based Curator Pipeline structure.
Previous Approach: Dask-Based Sequential Processing
The example below is a skeleton Dask-based data loading workflow. Each operation is represented as a function and applied to the entire dataset at once.
New Approach: Ray-Based Modular Pipelines
The example below implements the same skeleton workflow using the new Curator Pipeline architecture. Each stage is a standalone component focused on a specific operation and can be flexibly combined within a Pipeline object. In this new system, data flows through the pipeline as discrete tasks, each containing a batch of data (such as a DocumentBatch for text or ImageBatch for images). Each stage operates independently and in parallel on its assigned batch.
For more details about the new design, refer to the Curator Ray API Design documentation.
Migrating Text Curation Workflows
Previously, NeMo Curator loaded and processed text data as standardized DocumentDataset objects. These objects could then be used for further curation steps, including additional processing, filtering, and generation steps.
In the new release, this same functionality is available through a pipeline architecture, which uses stages to handle each discrete curation task.
The following example data loading pipeline showcases the differences between the Dask-based (previous) and Ray-based (current) curation approaches.
Step 1: Start a Distributed Computing Client
The script begins by initializing the distributed computing client. This client manages execution of tasks across multiple workers.
Previous: Dask Cluster
Initialize a local Dask cluster, specifying cluster_type="gpu" to leverage GPU resources.
New: Ray Cluster
Connect to a Ray cluster, which can manage tasks across CPU or GPU-backed nodes. For cluster setup details, refer to Production Deployment Requirements and the Ray documentation.
Step 2: Define Operations
In this step, core data curation operations—such as loading, cleaning, filtering, and deduplication—are defined. In Dask-based workflows, each processing step is written as a sequential call (often as Python functions or chained operations). In Ray-based workflows, each operation is expressed as a modular, declarative stage.
Example operations:
- Download the dataset and convert it to JSONL format
- Clean and unify the dataset (remove quotation marks, Unicode)
- Filter the dataset based on various criteria (word count, completeness)
- Remove exact duplicates from the dataset (deduplication)
Previous: Sequential Operations
In the previous version of NeMo Curator, the data loading and formatting process could be run sequentially, as individual functions or within main(), as in the code snippet below.
New: Modular Stages
In the new version, these operations are defined as discrete stages that operate on batches of data. Each stage can specify resources such as GPU count or CPU threads. For details on available filters , content cleaning operations , and pipeline concepts , refer to the linked documentation.
In the new version, deduplication should be run as a separate workflow using classes like ExactDeduplicationWorkflow, not embedded directly as a pipeline stage. For details and usage, refer to the text deduplication documentation .
Step 3: Create and Run Pipeline
After defining all the required processing steps, you can assemble and execute your workflow.
In the new version, a pipeline object can be created using the previously defined stages. The pipeline can then be run using the Curator Pipeline run() function. The pipeline is run using the Xenna executor.
Step 4: Stop the Client
As a final step, stop the distributed computing client to release resources and cleanly terminate your session.
Previous: Dask Client
New: Ray Client
This is a high-level example, and exact implementation details may vary. For more in-depth information about setting up text curation pipelines, refer to the text curation quickstart .
Migrating Image Curation Workflows
This section demonstrates how to transition Dask-based image curation to the new Ray-based modular pipeline.
The following steps walk through constructing and running an image curation workflow in the new release, highlighting differences and adjustments compared to the old workflow.
Step 1: Start a Distributed Computing Client
First, start your distributed computing client.
Previous: Dask Client
The previous version relied on a Dask client, specifying cluster_type="gpu" to leverage GPU resources.
New: Ray Client
The new version uses Ray, which can be initialized with the following code:
Step 2: Load and Preprocess Data
Next, load your image data. This step reads image files and prepares them for downstream processing.
Previous: Dataset-Based Loading
In the previous version of NeMo Curator, data loading was performed using helper functions from Curator dataset classes such as ImageTextPairDataset. This approach required users to directly manage dataset construction and often involved chaining Dask-based operations for filtering or transformation.
New: Stage-Based Loading
In the new version, data loading is encapsulated in a dedicated pipeline stage (see Image Processing Concepts for details). Instead of directly creating a dataset, users define an ImageReaderStage that handles reading from WebDataset .tar files.
Step 3: Generate CLIP Embeddings
Once the image-text data has been loaded, the next step is to convert it into vector representations using a CLIP (Contrastive Language-Image Pre-training) model. This allows the data to be used in tasks such as filtering, clustering, deduplication, and similarity search.
Previous: Direct Model Application
In the previous NeMo Curator version, embeddings were generated by instantiating an embedding model and applying it directly to the dataset object.
New: Embedding Stage
In the new version, embedding generation is handled by a dedicated ImageEmbeddingStage pipeline stage with configurable resource parameters (see CLIP Embedding Stage for details).
Step 4: Aesthetic Scoring
Aesthetic scoring assigns a quality score to each image based on its visual appeal. This score can be used to filter out poor-quality images from a dataset.
Previous: Classifier-Based Filtering
In the previous version, aesthetic scoring was performed by applying an AestheticClassifier directly to the dataset. This added a new column with scores and a boolean filter for high-quality images. The filtered dataset could then be saved using to_webdataset().
New: Aesthetic Filter Stage
In the new version, aesthetic scoring and filtering are handled by the ImageAestheticFilterStage (see Aesthetic Filter for details). This stage scores each image using a pretrained model and filters out images below a configured threshold.
Step 5: Semantic Deduplication
Semantic deduplication removes visually or semantically similar images from the dataset by clustering embeddings and eliminating near-duplicates based on similarity.
Previous: Multi-Step Clustering and Deduplication
The previous NeMo Curator version required two separate steps. First, image embeddings were clustered to group similar images. Second, deduplication, based on cosine similarity, was performed within clusters.
New: Single Deduplication Stage
In the new version, semantic deduplication is encapsulated in a single stage, SemanticDeduplicationStage (see Deduplication Concepts for comprehensive documentation). This stage handles clustering and duplicate removal internally, using the configured number of clusters and similarity threshold.
Step 6: Create and Run Pipeline
In the previous version, each command could be run directly; assembling the defined functions into a “pipeline” format was optional.
In the new version, once all the required stages are defined (data reading, embedding generation, aesthetic filtering, and deduplication), you can assemble them into a pipeline and run it using a Ray-based executor.
Step 7: Stop the Client
As a final step, stop the distributed computing client to release resources and cleanly terminate your session.
Previous: Dask Client
New: Ray Client
This is a high-level example, and exact implementation details may vary. For more in-depth information about setting up image curation pipelines, refer to the image curation quickstart .
Additional Resources
For questions about migration or other topics, refer to the Migration FAQ .