nemo_curator.stages.text.io.reader.parquet
nemo_curator.stages.text.io.reader.parquet
Module Contents
Classes
API
Dataclass
Bases: CompositeStage[_EmptyTask, DocumentBatch]
Composite stage for reading Parquet files.
This high-level stage decomposes into:
- FilePartitioningStage - partitions files into groups
- ParquetReaderStage - reads file groups into DocumentBatches
_assign_ids
_generate_ids
blocksize
fields
file_extensions
file_paths
files_per_partition
name
read_kwargs
task_type
Initialize parent class after dataclass initialization.
Decompose into file partitioning and processing stages.
Get a description of this composite stage.
Dataclass
Bases: BaseReader
Stage that processes a group of Parquet files into a DocumentBatch. This stage accepts FileGroupTasks created by FilePartitioningStage and reads the actual file contents into DocumentBatches.
Parameters:
fields
If specified, only read these columns. Defaults to None.
read_kwargs
Keyword arguments for the underlying reader. Defaults to {}.
name
Read Parquet files using Pandas. Raises an exception if reading fails.