nemo_curator.stages.text.modifiers.modifier
nemo_curator.stages.text.modifiers.modifier
Module Contents
Classes
Functions
API
Dataclass
Bases: ProcessingStage[DocumentBatch, DocumentBatch]
Modify fields of dataset records.
You can provide:
- a
DocumentModifierinstance; itsmodify_documentwill be used - a callable that takes a single value (single-input) or a dict[str, any] (multi-input)
- a list mixing the above, applied in order
Input fields can be:
- str (single field reused for each modifier)
- list[str] (one field per modifier)
- list[list[str]] (per-modifier multiple input fields)
Implicit behavior:
- If
output_fieldis None and each modifier has exactly one input field, results are written in-place to that input field. - If any modifier has multiple input fields,
output_fieldis required (provide a single name to reuse for all, or one per modifier).
Parameters:
modifier_fn
Modifier or list of modifiers to apply.
input_fields
Input field(s); see above for accepted forms.
output_fields
Output field name(s). If None and all inputs are single-column, in-place update is performed.
input_fields
modifier_fn
name
output_fields
Derive the stage name from the provided modifiers.
Normalize input fields into a list[list[str]] with one entry per modifier.
Resolve output column names to one per modifier.
Rules:
- None overall: in-place if all modifiers have exactly one input; else error.
- str overall: replicate for all modifiers.
- list overall (len 1 or len(modifiers)):
- Each entry may be a str (explicit output) or None (implicit in-place; requires single input).
Validate inputs and normalize the modifier(s) to a list.