nemo_curator.stages.text.io.reader.parquet

Module Contents

Classes

Name	Description
`ParquetReader`	Composite stage for reading Parquet files.
`ParquetReaderStage`	Stage that processes a group of Parquet files into a DocumentBatch.

API

class nemo_curator.stages.text.io.reader.parquet.ParquetReader(
    file_paths: str | list[str],
    files_per_partition: int | None = None,
    blocksize: int | str | None = None,
    fields: list[str] | None = None,
    read_kwargs: dict[str, typing.Any] | None = None,
    file_extensions: list[str] = (lambda: FILETYPE_TO_DEFAUL...,
    task_type: typing.Literal['document', 'image', 'video', 'audio'] = 'document',
    _generate_ids: bool = False,
    _assign_ids: bool = False,
    name: str = 'parquet_reader'
)

Dataclass

Bases: CompositeStage[_EmptyTask, DocumentBatch]

Composite stage for reading Parquet files.

This high-level stage decomposes into:

FilePartitioningStage - partitions files into groups
ParquetReaderStage - reads file groups into DocumentBatches

_assign_ids

bool = False

_generate_ids

bool = False

blocksize

int | str | None = None

fields

list[str] | None = None

file_extensions

list[str]

file_paths

str | list[str]

files_per_partition

int | None = None

name

str = 'parquet_reader'

read_kwargs

dict[str, Any] | None = None

task_type

Literal['document', 'image', 'video', 'audio'] = 'document'

nemo_curator.stages.text.io.reader.parquet.ParquetReader.__post_init__()

Initialize parent class after dataclass initialization.

nemo_curator.stages.text.io.reader.parquet.ParquetReader.decompose() -> list[nemo_curator.stages.text.io.reader.parquet.ParquetReaderStage]

Decompose into file partitioning and processing stages.

nemo_curator.stages.text.io.reader.parquet.ParquetReader.get_description() -> str

Get a description of this composite stage.

class nemo_curator.stages.text.io.reader.parquet.ParquetReaderStage(
    fields: list[str] | None = None,
    read_kwargs: dict[str, typing.Any] = dict(),
    name: str = 'parquet_reader',
    _generate_ids: bool = False,
    _assign_ids: bool = False
)

Dataclass

Bases: BaseReader

Stage that processes a group of Parquet files into a DocumentBatch. This stage accepts FileGroupTasks created by FilePartitioningStage and reads the actual file contents into DocumentBatches.

Parameters:

fields

list[str]Defaults to None

If specified, only read these columns. Defaults to None.

read_kwargs

dict[str, Any]Defaults to dict()

Keyword arguments for the underlying reader. Defaults to {}.

name

str = 'parquet_reader'

nemo_curator.stages.text.io.reader.parquet.ParquetReaderStage.read_data(
    paths: list[str],
    read_kwargs: dict[str, typing.Any] | None = None,
    fields: list[str] | None = None
) -> pandas.DataFrame

Read Parquet files using Pandas. Raises an exception if reading fails.