nemo_curator.stages.text.io.reader.jsonl

Module Contents

Classes

Name	Description
`JsonlReader`	Composite stage for reading JSONL files.
`JsonlReaderStage`	Stage that processes a group of JSONL files into a DocumentBatch.

API

class nemo_curator.stages.text.io.reader.jsonl.JsonlReader(
    file_paths: str | list[str],
    files_per_partition: int | None = None,
    blocksize: int | str | None = None,
    fields: list[str] | None = None,
    read_kwargs: dict[str, typing.Any] | None = None,
    task_type: typing.Literal['document', 'image', 'video', 'audio'] = 'document',
    file_extensions: list[str] = (lambda: FILETYPE_TO_DEFAUL...,
    _generate_ids: bool = False,
    _assign_ids: bool = False,
    name: str = 'jsonl_reader'
)

Dataclass

Bases: CompositeStage[_EmptyTask, DocumentBatch]

Composite stage for reading JSONL files.

This high-level stage decomposes into:

FilePartitioningStage - partitions files into groups
JsonlReaderStage - reads file groups into DocumentBatches

_assign_ids

bool = False

_generate_ids

bool = False

blocksize

int | str | None = None

fields

list[str] | None = None

file_extensions

list[str]

file_paths

str | list[str]

files_per_partition

int | None = None

name

str = 'jsonl_reader'

read_kwargs

dict[str, Any] | None = None

task_type

Literal['document', 'image', 'video', 'audio'] = 'document'

nemo_curator.stages.text.io.reader.jsonl.JsonlReader.__post_init__()

Initialize parent class after dataclass initialization.

nemo_curator.stages.text.io.reader.jsonl.JsonlReader.decompose() -> list[nemo_curator.stages.text.io.reader.jsonl.JsonlReaderStage]

Decompose into file partitioning and processing stages.

nemo_curator.stages.text.io.reader.jsonl.JsonlReader.get_description() -> str

Get a description of this composite stage.

class nemo_curator.stages.text.io.reader.jsonl.JsonlReaderStage(
    fields: list[str] | None = None,
    read_kwargs: dict[str, typing.Any] = dict(),
    name: str = 'jsonl_reader',
    _generate_ids: bool = False,
    _assign_ids: bool = False
)

Dataclass

Bases: BaseReader

Stage that processes a group of JSONL files into a DocumentBatch. This stage accepts FileGroupTasks created by FilePartitioningStage and reads the actual file contents into DocumentBatches.

Parameters:

fields

list[str]Defaults to None

If specified, only read these fields (columns). Defaults to None.

read_kwargs

dict[str, Any]Defaults to dict()

Keyword arguments for the reader. Defaults to {}.

_generate_ids

boolDefaults to False

Whether to generate monotonically increasing IDs across all files. This uses IdGenerator actor, which needs to be instantiated before using this stage. This can be slow, so it is recommended to use AddId stage instead, unless monotonically increasing IDs are required.

_assign_ids

boolDefaults to False

Whether to assign monotonically increasing IDs from an IdGenerator. This uses IdGenerator actor, which needs to be instantiated before using this stage. This can be slow, so it is recommended to use AddId stage instead, unless monotonically increasing IDs are required.

name

str = 'jsonl_reader'

nemo_curator.stages.text.io.reader.jsonl.JsonlReaderStage.read_data(
    paths: list[str],
    read_kwargs: dict[str, typing.Any] | None = None,
    fields: list[str] | None = None
) -> pandas.DataFrame | None

Read JSONL files using Pandas.