ProcessingStage
The ProcessingStage class is the base class for all data processing stages in NeMo Curator. Each stage defines a single step in a data curation pipeline.
Import
Class Definition
Abstract Methods
inputs()
Define stage input requirements.
outputs()
Define stage output requirements.
process()
Process a single task.
Optional Lifecycle Methods
setup_on_node()
Node-level initialization (e.g., download models).
setup()
Worker-level initialization (e.g., load models).
teardown()
Cleanup after processing.
process_batch()
Vectorized batch processing for better performance.
Creating Custom Stages
Per-Stage Runtime Environments
Stages can declare isolated Python dependencies using Ray’s native runtime_env. Set runtime_env as a class variable to specify packages that should be installed in an isolated virtualenv for that stage’s workers:
You can also override runtime_env at instantiation time using with_():
All three execution backends (XennaExecutor, RayDataExecutor, RayActorPoolExecutor) support per-stage runtime environments. See the Per-Stage Runtime Environments reference for details.
Configuration with with_()
Stages can be configured using the with_() method: