nemo_curator.stages.text.io.writer.base

View as Markdown

Module Contents

Classes

NameDescription
BaseWriterBase class for all writer stages.

API

class nemo_curator.stages.text.io.writer.base.BaseWriter(
path: str,
file_extension: str,
write_kwargs: dict[str, typing.Any] = dict(),
fields: list[str] | None = None,
name: str = 'BaseWriter',
mode: typing.Literal['ignore', 'overwrite', 'append', 'error'] = 'ignore',
append_mode_implemented: bool = False
)
DataclassAbstract

Bases: ProcessingStage[DocumentBatch, FileGroupTask]

Base class for all writer stages.

This abstract base class provides common functionality for writing DocumentBatch tasks to files, including file naming, metadata handling, and filesystem operations.

append_mode_implemented
bool = False
fields
list[str] | None = None
file_extension
str
mode
Literal['ignore', 'overwrite', 'append', 'error'] = 'ignore'
name
str = 'BaseWriter'
path
str
write_kwargs
dict[str, Any] = field(default_factory=dict)
nemo_curator.stages.text.io.writer.base.BaseWriter.__post_init__()
nemo_curator.stages.text.io.writer.base.BaseWriter.get_file_extension() -> str

Return the file extension for this writer format.

nemo_curator.stages.text.io.writer.base.BaseWriter.inputs() -> tuple[list[str], list[str]]
nemo_curator.stages.text.io.writer.base.BaseWriter.outputs() -> tuple[list[str], list[str]]
nemo_curator.stages.text.io.writer.base.BaseWriter.process(
task: nemo_curator.tasks.DocumentBatch
) -> nemo_curator.tasks.FileGroupTask

Process a DocumentBatch and write to files.

Parameters:

task
DocumentBatch

DocumentBatch containing data to write

Returns: FileGroupTask

Task containing paths to written files

nemo_curator.stages.text.io.writer.base.BaseWriter.write_data(
task: nemo_curator.tasks.DocumentBatch,
file_path: str
) -> None
abstract

Write data to file using format-specific implementation.