Migration Guide | NeMo Curator

NeMo Curator currently uses Ray as its execution engine for all modalities. The older Dask-based API has been removed. This guide is for users migrating existing Dask-based workflows to the current Ray-based architecture. If you are new to NeMo Curator, start with the Getting Started guides instead.

This guide explains how to transition existing Dask-based NeMo Curator workflows to the new Ray-based pipeline architecture.

For broader NeMo Framework migration topics, refer to the NeMo Framework 2.0 Migration Guide.

Overview

NeMo Curator previously used Dask as its primary execution engine for distributed data processing. The latest Curator architecture transitions to Ray as a unified backend, enabling all modalities—text, image, video, and audio—to use a single, consistent execution engine.

Workflows built as sequential function calls will need to be refactored into pipelines composed of modular stages. This migration guide explains how to transition existing workflows to the new modular, Ray-based Curator Pipeline structure.

Previous Approach: Dask-Based Sequential Processing

The example below is a skeleton Dask-based data loading workflow. Each operation is represented as a function and applied to the entire dataset at once.

1 # Old workflow: Sequential Dask-based processing
2 dataset = DocumentDataset.read_json()
3 processor = Processor()
4 processed = processor(dataset)
5 result = processed.to_parquet("output")

New Approach: Ray-Based Modular Pipelines

The example below implements the same skeleton workflow using the new Curator Pipeline architecture. Each stage is a standalone component focused on a specific operation and can be flexibly combined within a Pipeline object. In this new system, data flows through the pipeline as discrete tasks, each containing a batch of data (such as a DocumentBatch for text or ImageBatch for images). Each stage operates independently and in parallel on its assigned batch.

1 # New workflow: Modular, Ray-based pipeline
2 pipeline = Pipeline(name="my_pipeline")
3 pipeline.add_stage(ReaderStage())
4 pipeline.add_stage(ProcessingStage())
5 pipeline.add_stage(WriterStage())
6 results = pipeline.run(executor)

For more details about the new design, refer to the Curator Ray API Design documentation.

Migrating Text Curation Workflows

Previously, NeMo Curator loaded and processed text data as standardized DocumentDataset objects. These objects could then be used for further curation steps, including additional processing, filtering, and generation steps.

In the new release, this same functionality is available through a pipeline architecture, which uses stages to handle each discrete curation task.

The following example data loading pipeline showcases the differences between the Dask-based (previous) and Ray-based (current) curation approaches.

Step 1: Start a Distributed Computing Client

The script begins by initializing the distributed computing client. This client manages execution of tasks across multiple workers.

Previous: Dask Cluster

Initialize a local Dask cluster, specifying cluster_type="gpu" to leverage GPU resources.

1 # Old: Dask
2 from nemo_curator.utils.distributed_utils import get_client
3 dask_client = get_client(cluster_type="gpu")

New: Ray Cluster

Connect to a Ray cluster, which can manage tasks across CPU or GPU-backed nodes. For cluster setup details, refer to Production Deployment Requirements and the Ray documentation.

1 # New: Ray
2 from nemo_curator.core.client import RayClient
3 ray_client = RayClient()
4 ray_client.start()

Step 2: Define Operations

In this step, core data curation operations—such as loading, cleaning, filtering, and deduplication—are defined. In Dask-based workflows, each processing step is written as a sequential call (often as Python functions or chained operations). In Ray-based workflows, each operation is expressed as a modular, declarative stage.

Example operations:

Download the dataset and convert it to JSONL format
Clean and unify the dataset (remove quotation marks, Unicode)
Filter the dataset based on various criteria (word count, completeness)
Remove exact duplicates from the dataset (deduplication)

Previous: Sequential Operations

In the previous version of NeMo Curator, the data loading and formatting process could be run sequentially, as individual functions or within main(), as in the code snippet below.

1 # Old: Define curation logic
2 def main():
3     dataset = DocumentDataset.read_json(files)
4     
5     # Clean and unify
6     cleaners = Sequential([
7         Modify(QuotationUnifier()),
8         Modify(UnicodeReformatter()),
9     ])
10     dataset = cleaners(dataset)
11 
12     # Filter
13     filters = Sequential([
14         ScoreFilter(WordCountFilter(min_words=80)),
15         ScoreFilter(IncompleteStoryFilter()),
16     ])
17     dataset = filters(dataset)
18 
19     # Deduplicate
20     deduplicator = ExactDuplicates()
21     duplicates = deduplicator(dataset)
22     dataset = deduplicator.remove(dataset, duplicates)
23 
24     # Write results
25     dataset.to_json(out_path, write_to_filename=True)

New: Modular Stages

In the new version, these operations are defined as discrete stages that operate on batches of data. Each stage can specify resources such as GPU count or CPU threads. For details on available filters , content cleaning operations , and pipeline concepts , refer to the linked documentation.

1 # New: Define stages
2 stages = [
3     TinyStoriesDownloadExtractStage(raw_dir, split=args.split),
4     Modify(modifier_fn=QuotationUnifier()),
5     Modify(modifier_fn=UnicodeReformatter()),
6     ScoreFilter(filter_obj=WordCountFilter(min_words=80)),
7     ScoreFilter(filter_obj=IncompleteStoryFilter()),
8     JsonlWriter(curated_dir),
9 ]

In the new version, deduplication should be run as a separate workflow using classes like ExactDeduplicationWorkflow, not embedded directly as a pipeline stage. For details and usage, refer to the text deduplication documentation .

Step 3: Create and Run Pipeline

After defining all the required processing steps, you can assemble and execute your workflow.

In the new version, a pipeline object can be created using the previously defined stages. The pipeline can then be run using the Curator Pipeline run() function. The pipeline is run using the Xenna executor.

1 # New: Create pipeline with stages
2 pipeline = Pipeline(
3     name="tinystories",
4     description="Download and curation pipeline for the TinyStories dataset.",
5     stages=stages,
6 )
7 
8 # Create executor (see [Pipeline Execution Backends ](/reference/infra/execution-backends) for configuration options)
9 executor = XennaExecutor()
10 
11 # Execute pipeline
12 pipeline.run(executor)

Step 4: Stop the Client

As a final step, stop the distributed computing client to release resources and cleanly terminate your session.

Previous: Dask Client

1 # Close Dask client
2 dask_client.close()

New: Ray Client

1 # Stop Ray client
2 ray_client.stop()

This is a high-level example, and exact implementation details may vary. For more in-depth information about setting up text curation pipelines, refer to the text curation quickstart .

Migrating Image Curation Workflows

This section demonstrates how to transition Dask-based image curation to the new Ray-based modular pipeline.

The following steps walk through constructing and running an image curation workflow in the new release, highlighting differences and adjustments compared to the old workflow.

Step 1: Start a Distributed Computing Client

First, start your distributed computing client.

Previous: Dask Client

The previous version relied on a Dask client, specifying cluster_type="gpu" to leverage GPU resources.

1 # Old: Dask
2 from nemo_curator.utils.distributed_utils import get_client
3 dask_client = get_client(cluster_type="gpu")

New: Ray Client

The new version uses Ray, which can be initialized with the following code:

1 # New: Ray
2 from nemo_curator.core.client import RayClient
3 ray_client = RayClient()
4 ray_client.start()

Step 2: Load and Preprocess Data

Next, load your image data. This step reads image files and prepares them for downstream processing.

Previous: Dataset-Based Loading

In the previous version of NeMo Curator, data loading was performed using helper functions from Curator dataset classes such as ImageTextPairDataset. This approach required users to directly manage dataset construction and often involved chaining Dask-based operations for filtering or transformation.

1 # Old: Load Dataset
2 dataset = ImageTextPairDataset.from_webdataset(dataset_path, id_col)

New: Stage-Based Loading

In the new version, data loading is encapsulated in a dedicated pipeline stage (see Image Processing Concepts for details). Instead of directly creating a dataset, users define an ImageReaderStage that handles reading from WebDataset .tar files.

1 # New: Read images from webdataset tar files
2 read_stage = ImageReaderStage(
3     dali_batch_size=args.batch_size,
4     num_threads=16,
5     num_gpus_per_worker=0.25,
6 )

Step 3: Generate CLIP Embeddings

Once the image-text data has been loaded, the next step is to convert it into vector representations using a CLIP (Contrastive Language-Image Pre-training) model. This allows the data to be used in tasks such as filtering, clustering, deduplication, and similarity search.

Previous: Direct Model Application

In the previous NeMo Curator version, embeddings were generated by instantiating an embedding model and applying it directly to the dataset object.

1 # Old: Generate CLIP embeddings for images
2 from nemo_curator.image.embedders import TimmImageEmbedder
3 
4 embedding_model = TimmImageEmbedder(
5     "vit_large_patch14_clip_quickgelu_224.openai",
6     pretrained=True,
7 )
8 
9 dataset = embedding_model(dataset)
10 
11 dataset.save_metadata()

New: Embedding Stage

In the new version, embedding generation is handled by a dedicated ImageEmbeddingStage pipeline stage with configurable resource parameters (see CLIP Embedding Stage for details).

1 # New: Generate CLIP embeddings for images
2 img_embedding_stage = ImageEmbeddingStage(
3     model_dir=args.model_dir,
4     num_gpus_per_worker=args.embedding_gpus_per_worker,
5     model_inference_batch_size=args.embedding_batch_size,
6     remove_image_data=False,
7     verbose=args.verbose,
8 )

Step 4: Aesthetic Scoring

Aesthetic scoring assigns a quality score to each image based on its visual appeal. This score can be used to filter out poor-quality images from a dataset.

Previous: Classifier-Based Filtering

In the previous version, aesthetic scoring was performed by applying an AestheticClassifier directly to the dataset. This added a new column with scores and a boolean filter for high-quality images. The filtered dataset could then be saved using to_webdataset().

1 # Old: Generate aesthetic quality scores and filter
2 from nemo_curator.image.classifiers import AestheticClassifier
3 
4 aesthetic_classifier = AestheticClassifier()
5 dataset = aesthetic_classifier(dataset)
6 
7 dataset.to_webdataset(aesthetic_dataset_path, filter_column="passes_aesthetic_check")

New: Aesthetic Filter Stage

In the new version, aesthetic scoring and filtering are handled by the ImageAestheticFilterStage (see Aesthetic Filter for details). This stage scores each image using a pretrained model and filters out images below a configured threshold.

1 # New: Generate aesthetic quality scores and filter
2 aesthetic_filter_stage = ImageAestheticFilterStage(
3     model_dir=args.model_dir,
4     num_gpus_per_worker=args.aesthetic_gpus_per_worker,
5     model_inference_batch_size=args.aesthetic_batch_size,
6     score_threshold=args.aesthetic_threshold,
7     verbose=args.verbose,
8 )

Step 5: Semantic Deduplication

Semantic deduplication removes visually or semantically similar images from the dataset by clustering embeddings and eliminating near-duplicates based on similarity.

Previous: Multi-Step Clustering and Deduplication

The previous NeMo Curator version required two separate steps. First, image embeddings were clustered to group similar images. Second, deduplication, based on cosine similarity, was performed within clusters.

1 # Old: Semantic Deduplication
2 
3 # Cluster image embeddings
4 clustering_model = ClusteringModel(
5     id_column=id_col,
6     embedding_column="image_embedding",
7     clustering_output_dir=clustering_output,
8 )
9 clustered_dataset = clustering_model(embeddings_dataset)
10 
11 # Run cluster-level dedup
12 emb_by_cluster_output = os.path.join(clustering_output, "embs_by_nearest_center")
13 duplicate_output = os.path.join(semantic_dedup_outputs, "duplicates")
14 
15 semantic_dedup = SemanticClusterLevelDedup(
16     n_clusters=1,
17     emb_by_clust_dir=emb_by_cluster_output,
18     id_column=id_col,
19     which_to_keep="hard",
20     sim_metric="cosine",
21     embedding_column="image_embedding",
22     batched_cosine_similarity=1024,
23     output_dir=duplicate_output,
24 )
25 semantic_dedup.compute_semantic_match_dfs()
26 deduplicated_dataset_ids = semantic_dedup.extract_dedup_data(eps_to_extract=0.01)
27 
28 # Remove duplicates
29 deduplicated_dataset_path = "./deduplicated_dataset"
30 dataset.metadata["is_unique"] = ~dataset.metadata["key"].isin(
31     deduplicated_dataset_ids.df["key"].compute(),
32 )
33 dataset.to_webdataset(deduplicated_dataset_path, "is_unique")

New: Single Deduplication Stage

In the new version, semantic deduplication is encapsulated in a single stage, SemanticDeduplicationStage (see Deduplication Concepts for comprehensive documentation). This stage handles clustering and duplicate removal internally, using the configured number of clusters and similarity threshold.

1 # New: Semantic Deduplication
2 semantic_dedup_stage = SemanticDeduplicationStage(
3     id_field="image_id",
4     embedding_field="image_embedding",
5     n_clusters=100,
6     eps=0.01,
7 )

Step 6: Create and Run Pipeline

In the previous version, each command could be run directly; assembling the defined functions into a “pipeline” format was optional.

In the new version, once all the required stages are defined (data reading, embedding generation, aesthetic filtering, and deduplication), you can assemble them into a pipeline and run it using a Ray-based executor.

1 # New: Define pipeline
2 pipeline = Pipeline(
3     name="image_curation",
4     description="Curate images with embeddings and quality scoring",
5     stages=[
6         read_stage,
7         img_embedding_stage,
8         semantic_dedup_stage,
9     ],
10 )
11 
12 # Create executor
13 executor = XennaExecutor()
14 
15 # Execute pipeline
16 pipeline.run(executor)

Step 7: Stop the Client

As a final step, stop the distributed computing client to release resources and cleanly terminate your session.

Previous: Dask Client

1 # Old: Close Dask client
2 dask_client.close()

New: Ray Client

1 # New: Stop Ray client
2 ray_client.stop()

This is a high-level example, and exact implementation details may vary. For more in-depth information about setting up image curation pipelines, refer to the image curation quickstart .

Additional Resources

For questions about migration or other topics, refer to the Migration FAQ .