The ProcessingStage class is the base class for all data processing stages in NeMo Curator. Each stage defines a single step in a data curation pipeline.
inputs()Define stage input requirements.
outputs()Define stage output requirements.
process()Process a single task.
setup_on_node()Node-level initialization (e.g., download models).
setup()Worker-level initialization (e.g., load models).
teardown()Cleanup after processing.
process_batch()Vectorized batch processing for better performance.
with_()Stages can be configured using the with_() method: