nemo_curator.stages.text.io.reader.jsonl

View as Markdown

Module Contents

Classes

NameDescription
JsonlReaderComposite stage for reading JSONL files.
JsonlReaderStageStage that processes a group of JSONL files into a DocumentBatch.

API

class nemo_curator.stages.text.io.reader.jsonl.JsonlReader(
file_paths: str | list[str],
files_per_partition: int | None = None,
blocksize: int | str | None = None,
fields: list[str] | None = None,
read_kwargs: dict[str, typing.Any] | None = None,
task_type: typing.Literal['document', 'image', 'video', 'audio'] = 'document',
file_extensions: list[str] = (lambda: FILETYPE_TO_DEFAUL...,
_generate_ids: bool = False,
_assign_ids: bool = False,
name: str = 'jsonl_reader'
)
Dataclass

Bases: CompositeStage[_EmptyTask, DocumentBatch]

Composite stage for reading JSONL files.

This high-level stage decomposes into:

  1. FilePartitioningStage - partitions files into groups
  2. JsonlReaderStage - reads file groups into DocumentBatches
_assign_ids
bool = False
_generate_ids
bool = False
blocksize
int | str | None = None
fields
list[str] | None = None
file_extensions
list[str]
file_paths
str | list[str]
files_per_partition
int | None = None
name
str = 'jsonl_reader'
read_kwargs
dict[str, Any] | None = None
task_type
Literal['document', 'image', 'video', 'audio'] = 'document'
nemo_curator.stages.text.io.reader.jsonl.JsonlReader.__post_init__()

Initialize parent class after dataclass initialization.

nemo_curator.stages.text.io.reader.jsonl.JsonlReader.decompose() -> list[nemo_curator.stages.text.io.reader.jsonl.JsonlReaderStage]

Decompose into file partitioning and processing stages.

nemo_curator.stages.text.io.reader.jsonl.JsonlReader.get_description() -> str

Get a description of this composite stage.

class nemo_curator.stages.text.io.reader.jsonl.JsonlReaderStage(
fields: list[str] | None = None,
read_kwargs: dict[str, typing.Any] = dict(),
name: str = 'jsonl_reader',
_generate_ids: bool = False,
_assign_ids: bool = False
)
Dataclass

Bases: BaseReader

Stage that processes a group of JSONL files into a DocumentBatch. This stage accepts FileGroupTasks created by FilePartitioningStage and reads the actual file contents into DocumentBatches.

Parameters:

fields
list[str]Defaults to None

If specified, only read these fields (columns). Defaults to None.

read_kwargs
dict[str, Any]Defaults to dict()

Keyword arguments for the reader. Defaults to {}.

_generate_ids
boolDefaults to False

Whether to generate monotonically increasing IDs across all files. This uses IdGenerator actor, which needs to be instantiated before using this stage. This can be slow, so it is recommended to use AddId stage instead, unless monotonically increasing IDs are required.

_assign_ids
boolDefaults to False

Whether to assign monotonically increasing IDs from an IdGenerator. This uses IdGenerator actor, which needs to be instantiated before using this stage. This can be slow, so it is recommended to use AddId stage instead, unless monotonically increasing IDs are required.

name
str = 'jsonl_reader'
nemo_curator.stages.text.io.reader.jsonl.JsonlReaderStage.read_data(
paths: list[str],
read_kwargs: dict[str, typing.Any] | None = None,
fields: list[str] | None = None
) -> pandas.DataFrame | None

Read JSONL files using Pandas.