stages.text.io.reader.parquet
#
Module Contents#
Classes#
Composite stage for reading Parquet files. |
|
Stage that processes a group of Parquet files into a DocumentBatch. This stage accepts FileGroupTasks created by FilePartitioningStage and reads the actual file contents into DocumentBatches. |
API#
- class stages.text.io.reader.parquet.ParquetReader#
Bases:
nemo_curator.stages.base.CompositeStage
[nemo_curator.tasks._EmptyTask
,nemo_curator.tasks.DocumentBatch
]Composite stage for reading Parquet files.
This high-level stage decomposes into:
FilePartitioningStage - partitions files into groups
ParquetReaderStage - reads file groups into DocumentBatches
Initialization
- blocksize: int | str | None#
None
- decompose() list[stages.text.io.reader.parquet.ParquetReaderStage] #
Decompose into file partitioning and processing stages.
- fields: list[str] | None#
None
- file_extensions: list[str]#
‘field(…)’
- file_paths: str | list[str]#
None
- files_per_partition: int | None#
None
- get_description() str #
Get a description of this composite stage.
- read_kwargs: dict[str, Any] | None#
None
- task_type: Literal[document, image, video, audio]#
‘document’
- class stages.text.io.reader.parquet.ParquetReaderStage#
Bases:
stages.text.io.reader.base.BaseReader
Stage that processes a group of Parquet files into a DocumentBatch. This stage accepts FileGroupTasks created by FilePartitioningStage and reads the actual file contents into DocumentBatches.
Args: fields (list[str], optional): If specified, only read these columns. Defaults to None. read_kwargs (dict[str, Any], optional): Keyword arguments for the underlying reader. Defaults to {}.
- read_data(
- paths: list[str],
- read_kwargs: dict[str, Any] | None = None,
- fields: list[str] | None = None,
Read Parquet files using Pandas. Raises an exception if reading fails.