About NeMo CuratorRelease Notes

NeMo Curator Migration FAQ

View as Markdown

Frequently asked questions about migrating from Dask to Ray-based NeMo Curator.

For step-by-step migration instructions, refer to the Migration Guide .


General Architecture and Migration

The migration aims to deliver significantly better scalability and performance, especially for large, heterogeneous, and GPU-intensive workflows. Ray simplifies cluster resource management and enables more dynamic scheduling and throughput optimization compared to Dask. The underlying orchestration and task management are completely rewritten for efficiency.

Internal benchmarks show Ray matches or outperforms previous implementations, especially on multi-GPU clusters and for streaming, model-heavy workloads. We are finalizing transparent benchmark numbers and reproducible environments so users can run their own comparisons and verify performance in their clusters.

Some migration effort is expected. High-level abstractions—like tasks and stages—still map well, but the new system follows a strictly linear, stage-wise architecture (meaning each stage feeds directly into the next, without branches or fan-outs). If your pipeline structure or branching logic differs, adaptation will be needed. Additionally, Ray cluster deployment may require some effort if you have pre-existing automation or scripts for Dask or Spark.

Ray pipelines in NeMo Curator are strictly linear. If you need to apply multiple independent filters or models in parallel, combine them within a single stage (you can parallelize inside your code using subprocesses if needed). Direct fan-out or branching between stages is not supported to maximize throughput optimization and scheduling simplicity.


Pipeline Design and Operation

Ray pipelines use a streaming architecture where data tasks flow from stage to stage in-memory wherever possible (across Ray’s object store), minimizing costly file system I/O. Production runs process data continuously for maximum throughput. For more details, refer to Key Abstractions .

Each stage can write outputs or checkpoints to disk as needed (for recovery or later analysis). Pipeline stages do not wait for these writes to complete; instead, writes should be implemented in a non-blocking way inside your stage logic. For most use cases, this allows the main dataflow and processing to remain high-throughput and non-blocking. If you need to guarantee write success and handle failures, you can use subprocess-based logging or retry logic within your process functions.

Pipelines must be linear (no branching or fan-out between stages). If you need to apply multiple independent filters or models, group them together into a single stage and manage that in your own code (possibly with subprocesses).

Yes. The task creation stage lists all work units (for example, files from all snapshots) upfront. Ray then orchestrates parallel processing, maximizing resource utilization across the whole dataset—streaming batches through as tasks complete.

Yes. With the new approach, you can configure logic in your process functions to write out filtered or dropped examples as a “splitting” operation rather than a hard filter—enabling later analysis, retraining, or recycling of low-quality data.


Resource Management and Performance Optimization

An internal adaptive scheduler monitors throughput for each stage every few minutes (configurable) and dynamically adjusts worker counts and allocations to maximize total pipeline throughput and avoid bottlenecks.

Each stage specifies its own resource needs in code (see Pipeline Execution Backends for configuration details). Set required CPU or GPU count, GPU VRAM, and other specs directly. Ray packs tasks optimally (for example, several light jobs on one GPU).

Ray will restart any failed actor (worker), but handling logic for OOM or other errors is up to your code. Use try/except blocks to decide whether to retry, skip, log, or stop.

Yes. Ray natively orchestrates jobs across nodes and GPUs. For multi-node operations (for example, distributed model training or global deduplication), you’ll set up communication (such as NCCL for PyTorch) within the relevant stage’s setup logic.

Internally, an allocator ensures data locality for heavy data transfer stages (for example, tasks passing large data between GPU-intensive stages tend to run on the same GPU or node to minimize transfer times). These are handled by Cosmos Xenna or Ray’s underlying allocators and optimization routines—not directly inside NeMo Curator itself. These frameworks attempt to colocate data and computation to minimize data transfer and maximize throughput, according to your resource constraints and specifications.


Data Processing, Models, and Quality

NeMo Curator suggests several strategies (see Heuristic Filtering and Quality Assessment for details):

  • Use available multilingual models (for example, Qwen, Mistral, or other models with many language capabilities)
  • Annotate high-quality English data with a classifier, translate these data to the target language, and then train a smaller in-language model
  • Upsample high-quality data or downsample and rewrite lower-quality data instead of discarding it (using an LLM rewrite for quality improvement)

A set of models assigns a quality score (for example, 0–20), bucketed into high, medium, or low groups. High-quality data might be upsampled or have synthetic variants generated; low-quality data is rewritten rather than simply dropped—especially valuable for low-resource contexts.

Each task is a data class. You can add whatever statistics you need (input or output counts, tokens dropped, and so on) within stage logic for detailed reporting or logging.


Fault Tolerance, Checkpointing, and Observability

Ray will relaunch failed workers or actors, but robust error handling and resumption (for example, from last completed task) should be implemented in your stage logic as needed.

All Ray pipelines expose resource and processing metrics via a built-in Grafana dashboard (with process time per task or actor, resource utilization, and so on). You can also summarize stats from task data at pipeline completion for custom reporting. For configuration options, refer to Pipeline Execution Backends and the Ray Dashboard documentation.

For streaming pipelines, each completed task can be tracked (temporarily with final outputs), so resumption logic is “start from task N+1”. If you rely on explicit intermediate checkpoints, you can extend the process logic to save and reload state as needed.


Extensibility, Customization, and Collaboration

Yes, by design. Add new stages or modify process functions to integrate custom logic, models, or data preprocessing and postprocessing (see Key Abstractions for examples). Extend or fork example pipelines to suit new use cases.

Contributions are strongly encouraged. Submit pull requests or join community discussions to help expand NeMo Curator’s capabilities for diverse regions and languages.


Deployment, Infrastructure, and Practicalities

Ray clusters can be deployed on any major cloud platform (AWS, GCP, Azure) using standard Ray tools (see Ray documentation). No custom infrastructure is needed. NVIDIA provides ready-to-use Docker images and up-to-date quickstart guides. For complete deployment details, refer to Production Deployment Requirements .

Not for most standard uses. Use provided images or extend as needed (for example, to add proprietary or additional filters). Check out the official Docker container releases on the NGC Catalog.

Ray manages its own orchestration natively. For more traditional batch job orchestration, Airflow or Slurm can be used to launch clusters or submit jobs, but within a Ray-run pipeline orchestration is handled within Ray.


Advanced Topics

The current model supports scatter and map, but not reduce or fan-in across tasks (for global aggregations). For most data curation workflows, this is not a limitation; custom logic can be used where needed.

For multi-node operations, you’re responsible for managing communication setup (for example, establishing NCCL channels for distributed training within the relevant stage’s setup logic).

Yes. NeMo Curator supports multiple data modalities including text , image , audio , and video .


Support, Communication, and Community

All support and community engagement happens on GitHub. We encourage you to:

  • Open an issue for bugs, feature requests, or questions
  • Start a discussion to share ideas or ask for guidance
  • Submit a pull request if you build innovative filters or features

Regular community calls and check-ins are also offered to connect with the team and other users.


Additional Resources

If you find something missing or want to share a best practice or feature, please open an issue or submit a pull request on GitHub.