stages.text.io.reader.parquet#

Module Contents#

Classes#

ParquetReader

Composite stage for reading Parquet files.

ParquetReaderStage

Stage that processes a group of Parquet files into a DocumentBatch. This stage accepts FileGroupTasks created by FilePartitioningStage and reads the actual file contents into DocumentBatches.

API#

class stages.text.io.reader.parquet.ParquetReader#

Bases: nemo_curator.stages.base.CompositeStage[nemo_curator.tasks._EmptyTask, nemo_curator.tasks.DocumentBatch]

Composite stage for reading Parquet files.

This high-level stage decomposes into:

  1. FilePartitioningStage - partitions files into groups

  2. ParquetReaderStage - reads file groups into DocumentBatches

Initialization

blocksize: int | str | None#

None

decompose() list[stages.text.io.reader.parquet.ParquetReaderStage]#

Decompose into file partitioning and processing stages.

fields: list[str] | None#

None

file_extensions: list[str]#

‘field(…)’

file_paths: str | list[str]#

None

files_per_partition: int | None#

None

get_description() str#

Get a description of this composite stage.

read_kwargs: dict[str, Any] | None#

None

task_type: Literal[document, image, video, audio]#

‘document’

class stages.text.io.reader.parquet.ParquetReaderStage#

Bases: stages.text.io.reader.base.BaseReader

Stage that processes a group of Parquet files into a DocumentBatch. This stage accepts FileGroupTasks created by FilePartitioningStage and reads the actual file contents into DocumentBatches.

Args: fields (list[str], optional): If specified, only read these columns. Defaults to None. read_kwargs (dict[str, Any], optional): Keyword arguments for the underlying reader. Defaults to {}.

read_data(
paths: list[str],
read_kwargs: dict[str, Any] | None = None,
fields: list[str] | None = None,
) pandas.DataFrame#

Read Parquet files using Pandas. Raises an exception if reading fails.