`stages.text.io.reader.parquet`#

Module Contents#

Classes#

`ParquetReader`	Composite stage for reading Parquet files.
`ParquetReaderStage`	Stage that processes a group of Parquet files into a DocumentBatch. This stage accepts FileGroupTasks created by FilePartitioningStage and reads the actual file contents into DocumentBatches.

API#

class stages.text.io.reader.parquet.ParquetReader#

Bases: nemo_curator.stages.base.CompositeStage[nemo_curator.tasks._EmptyTask, nemo_curator.tasks.DocumentBatch]

Composite stage for reading Parquet files.

This high-level stage decomposes into:

FilePartitioningStage - partitions files into groups
ParquetReaderStage - reads file groups into DocumentBatches

Initialization

blocksize: int | str | None#: None

decompose() → list[stages.text.io.reader.parquet.ParquetReaderStage]#: Decompose into file partitioning and processing stages.

fields: list[str] | None#: None

file_extensions: list[str]#: ‘field(…)’

file_paths: str | list[str]#: None

files_per_partition: int | None#: None

get_description() → str#: Get a description of this composite stage.

name: str#: ‘parquet_reader’

read_kwargs: dict[str, Any] | None#: None

task_type: Literal[document, image, video, audio]#: ‘document’

class stages.text.io.reader.parquet.ParquetReaderStage#

Bases: stages.text.io.reader.base.BaseReader

Stage that processes a group of Parquet files into a DocumentBatch. This stage accepts FileGroupTasks created by FilePartitioningStage and reads the actual file contents into DocumentBatches.

Args: fields (list[str], optional): If specified, only read these columns. Defaults to None. read_kwargs (dict[str, Any], optional): Keyword arguments for the underlying reader. Defaults to {}.

name: str#: ‘parquet_reader’

read_data( paths: list[str], read_kwargs: dict[str, Any] | None = None, fields: list[str] | None = None, ) → pandas.DataFrame#: Read Parquet files using Pandas. Raises an exception if reading fails.

stages.text.io.reader.parquet#

Module Contents#

Classes#

API#

`stages.text.io.reader.parquet`#