nemo_curator.stages.text.io.reader.parquet

View as Markdown

Module Contents

Classes

NameDescription
ParquetReaderComposite stage for reading Parquet files.
ParquetReaderStageStage that processes a group of Parquet files into a DocumentBatch.

API

class nemo_curator.stages.text.io.reader.parquet.ParquetReader(
file_paths: str | list[str],
files_per_partition: int | None = None,
blocksize: int | str | None = None,
fields: list[str] | None = None,
read_kwargs: dict[str, typing.Any] | None = None,
file_extensions: list[str] = (lambda: FILETYPE_TO_DEFAUL...,
task_type: typing.Literal['document', 'image', 'video', 'audio'] = 'document',
_generate_ids: bool = False,
_assign_ids: bool = False,
name: str = 'parquet_reader'
)
Dataclass

Bases: CompositeStage[_EmptyTask, DocumentBatch]

Composite stage for reading Parquet files.

This high-level stage decomposes into:

  1. FilePartitioningStage - partitions files into groups
  2. ParquetReaderStage - reads file groups into DocumentBatches
_assign_ids
bool = False
_generate_ids
bool = False
blocksize
int | str | None = None
fields
list[str] | None = None
file_extensions
list[str]
file_paths
str | list[str]
files_per_partition
int | None = None
name
str = 'parquet_reader'
read_kwargs
dict[str, Any] | None = None
task_type
Literal['document', 'image', 'video', 'audio'] = 'document'
nemo_curator.stages.text.io.reader.parquet.ParquetReader.__post_init__()

Initialize parent class after dataclass initialization.

nemo_curator.stages.text.io.reader.parquet.ParquetReader.decompose() -> list[nemo_curator.stages.text.io.reader.parquet.ParquetReaderStage]

Decompose into file partitioning and processing stages.

nemo_curator.stages.text.io.reader.parquet.ParquetReader.get_description() -> str

Get a description of this composite stage.

class nemo_curator.stages.text.io.reader.parquet.ParquetReaderStage(
fields: list[str] | None = None,
read_kwargs: dict[str, typing.Any] = dict(),
name: str = 'parquet_reader',
_generate_ids: bool = False,
_assign_ids: bool = False
)
Dataclass

Bases: BaseReader

Stage that processes a group of Parquet files into a DocumentBatch. This stage accepts FileGroupTasks created by FilePartitioningStage and reads the actual file contents into DocumentBatches.

Parameters:

fields
list[str]Defaults to None

If specified, only read these columns. Defaults to None.

read_kwargs
dict[str, Any]Defaults to dict()

Keyword arguments for the underlying reader. Defaults to {}.

name
str = 'parquet_reader'
nemo_curator.stages.text.io.reader.parquet.ParquetReaderStage.read_data(
paths: list[str],
read_kwargs: dict[str, typing.Any] | None = None,
fields: list[str] | None = None
) -> pandas.DataFrame

Read Parquet files using Pandas. Raises an exception if reading fails.