nemo_curator.stages.text.modifiers.modifier

View as Markdown

Module Contents

Classes

NameDescription
ModifyModify fields of dataset records.

Functions

NameDescription
_get_modifier_stage_nameDerive the stage name from the provided modifiers.
_modifier_name-
_normalize_input_fieldsNormalize input fields into a list[list[str]] with one entry per modifier.
_normalize_output_fieldsResolve output column names to one per modifier.
_validate_and_normalize_modifiersValidate inputs and normalize the modifier(s) to a list.

API

class nemo_curator.stages.text.modifiers.modifier.Modify(
modifier_fn: collections.abc.Callable | nemo_curator.stages.text.modifiers.doc_modifier.DocumentModifier | list[nemo_curator.stages.text.modifiers.doc_modifier.DocumentModifier | collections.abc.Callable], modifier_fn: collections.abc.Callable | nemo_curator.stages.text.modifiers.doc_modifier.DocumentModifier | list[nemo_curator.stages.text.modifiers.doc_modifier.DocumentModifier | collections.abc.Callable],
input_fields: str | list[str] | list[list[str]] = 'text',
output_fields: str | list[str | None] | None = None,
name: str = 'modifier_fn'
)
Dataclass

Bases: ProcessingStage[DocumentBatch, DocumentBatch]

Modify fields of dataset records.

You can provide:

  • a DocumentModifier instance; its modify_document will be used
  • a callable that takes a single value (single-input) or a dict[str, any] (multi-input)
  • a list mixing the above, applied in order

Input fields can be:

  • str (single field reused for each modifier)
  • list[str] (one field per modifier)
  • list[list[str]] (per-modifier multiple input fields)

Implicit behavior:

  • If output_field is None and each modifier has exactly one input field, results are written in-place to that input field.
  • If any modifier has multiple input fields, output_field is required (provide a single name to reuse for all, or one per modifier).

Parameters:

modifier_fn
Callable | DocumentModifier | list[DocumentModifier | Callable]

Modifier or list of modifiers to apply.

input_fields
str | list[str] | list[list[str]]Defaults to 'text'

Input field(s); see above for accepted forms.

output_fields
str | list[str] | NoneDefaults to None

Output field name(s). If None and all inputs are single-column, in-place update is performed.

input_fields
str | list[str] | list[list[str]] = 'text'
modifier_fn
Callable | DocumentModifier | list[DocumentModifier | Callable]
name
str = 'modifier_fn'
output_fields
str | list[str | None] | None = None
nemo_curator.stages.text.modifiers.modifier.Modify.__post_init__()
nemo_curator.stages.text.modifiers.modifier.Modify.inputs() -> tuple[list[str], list[str]]
nemo_curator.stages.text.modifiers.modifier.Modify.outputs() -> tuple[list[str], list[str]]
nemo_curator.stages.text.modifiers.modifier.Modify.process(
batch: nemo_curator.tasks.DocumentBatch
) -> nemo_curator.tasks.DocumentBatch | None
nemo_curator.stages.text.modifiers.modifier._get_modifier_stage_name(
modifiers: list[nemo_curator.stages.text.modifiers.doc_modifier.DocumentModifier | collections.abc.Callable]
) -> str

Derive the stage name from the provided modifiers.

nemo_curator.stages.text.modifiers.modifier._modifier_name(
x: nemo_curator.stages.text.modifiers.doc_modifier.DocumentModifier | collections.abc.Callable
) -> str
nemo_curator.stages.text.modifiers.modifier._normalize_input_fields(
input_fields: str | list[str] | list[list[str]],
modifiers: list[nemo_curator.stages.text.modifiers.doc_modifier.DocumentModifier | collections.abc.Callable]
) -> list[list[str]]

Normalize input fields into a list[list[str]] with one entry per modifier.

nemo_curator.stages.text.modifiers.modifier._normalize_output_fields(
output_fields: str | list[str | None] | None,
input_fields: list[list[str]],
modifiers: list[nemo_curator.stages.text.modifiers.doc_modifier.DocumentModifier | collections.abc.Callable]
) -> list[str]

Resolve output column names to one per modifier.

Rules:

  • None overall: in-place if all modifiers have exactly one input; else error.
  • str overall: replicate for all modifiers.
  • list overall (len 1 or len(modifiers)):
    • Each entry may be a str (explicit output) or None (implicit in-place; requires single input).
nemo_curator.stages.text.modifiers.modifier._validate_and_normalize_modifiers(
_modifier: nemo_curator.stages.text.modifiers.doc_modifier.DocumentModifier | collections.abc.Callable | list[nemo_curator.stages.text.modifiers.doc_modifier.DocumentModifier | collections.abc.Callable], _modifier: nemo_curator.stages.text.modifiers.doc_modifier.DocumentModifier | collections.abc.Callable | list[nemo_curator.stages.text.modifiers.doc_modifier.DocumentModifier | collections.abc.Callable],
input_field: str | list[str] | list[list[str]] | None
) -> list[nemo_curator.stages.text.modifiers.doc_modifier.DocumentModifier | collections.abc.Callable]

Validate inputs and normalize the modifier(s) to a list.