nemo_curator.stages.text.io.writer.parquet

View as Markdown

Module Contents

Classes

NameDescription
ParquetWriterWriter that writes a DocumentBatch to a Parquet file using pandas.

API

class nemo_curator.stages.text.io.writer.parquet.ParquetWriter(
path: str,
file_extension: str = 'parquet',
write_kwargs: dict[str, typing.Any] = dict(),
fields: list[str] | None = None,
name: str = 'parquet_writer',
mode: typing.Literal['ignore', 'overwrite', 'append', 'error'] = 'ignore',
append_mode_implemented: bool = False
)
Dataclass

Bases: BaseWriter

Writer that writes a DocumentBatch to a Parquet file using pandas.

file_extension
str = 'parquet'
name
str = 'parquet_writer'
write_kwargs
dict[str, Any] = field(default_factory=dict)
nemo_curator.stages.text.io.writer.parquet.ParquetWriter.write_data(
task: nemo_curator.tasks.DocumentBatch,
file_path: str
) -> None

Write data to Parquet file using pandas DataFrame.to_parquet.