The ProcessingStage class is the base class for all data processing stages in NeMo Curator. Each stage defines a single step in a data curation pipeline.
inputs()Define stage input requirements.
outputs()Define stage output requirements.
process()Process a single task.
setup_on_node()Node-level initialization (e.g., download models).
setup()Worker-level initialization (e.g., load models).
teardown()Cleanup after processing.
process_batch()Vectorized batch processing for better performance.
Stages can declare isolated Python dependencies using Ray’s native runtime_env. Set runtime_env as a class variable to specify packages that should be installed in an isolated virtualenv for that stage’s workers:
You can also override runtime_env at instantiation time using with_():
All three execution backends (XennaExecutor, RayDataExecutor, RayActorPoolExecutor) support per-stage runtime environments. See the Per-Stage Runtime Environments reference for details.
with_()Stages can be configured using the with_() method: