`stages.text.io.reader.jsonl`#

Module Contents#

Classes#

`JsonlReader`	Composite stage for reading JSONL files.
`JsonlReaderStage`	Stage that processes a group of JSONL files into a DocumentBatch. This stage accepts FileGroupTasks created by FilePartitioningStage and reads the actual file contents into DocumentBatches.

API#

class stages.text.io.reader.jsonl.JsonlReader#

Bases: nemo_curator.stages.base.CompositeStage[nemo_curator.tasks._EmptyTask, nemo_curator.tasks.DocumentBatch]

Composite stage for reading JSONL files.

This high-level stage decomposes into:

FilePartitioningStage - partitions files into groups
JsonlReaderStage - reads file groups into DocumentBatches

Initialization

blocksize: int | str | None#: None

decompose() → list[stages.text.io.reader.jsonl.JsonlReaderStage]#: Decompose into file partitioning and processing stages.

fields: list[str] | None#: None

file_extensions: list[str]#: ‘field(…)’

file_paths: str | list[str]#: None

files_per_partition: int | None#: None

get_description() → str#: Get a description of this composite stage.

name: str#: ‘jsonl_reader’

read_kwargs: dict[str, Any] | None#: None

task_type: Literal[document, image, video, audio]#: ‘document’

class stages.text.io.reader.jsonl.JsonlReaderStage#

Bases: stages.text.io.reader.base.BaseReader

Stage that processes a group of JSONL files into a DocumentBatch. This stage accepts FileGroupTasks created by FilePartitioningStage and reads the actual file contents into DocumentBatches.

Args: fields (list[str], optional): If specified, only read these fields (columns). Defaults to None. read_kwargs (dict[str, Any], optional): Keyword arguments for the reader. Defaults to {}. _generate_ids (bool): Whether to generate monotonically increasing IDs across all files. This uses IdGenerator actor, which needs to be instantiated before using this stage. This can be slow, so it is recommended to use AddId stage instead, unless monotonically increasing IDs are required. _assign_ids (bool): Whether to assign monotonically increasing IDs from an IdGenerator. This uses IdGenerator actor, which needs to be instantiated before using this stage. This can be slow, so it is recommended to use AddId stage instead, unless monotonically increasing IDs are required.

name: str#: ‘jsonl_reader’

read_data( paths: list[str], read_kwargs: dict[str, Any] | None = None, fields: list[str] | None = None, ) → pandas.DataFrame | None#: Read JSONL files using Pandas.

stages.text.io.reader.jsonl#

Module Contents#

Classes#

API#

`stages.text.io.reader.jsonl`#