stages.text.io.reader.jsonl
#
Module Contents#
Classes#
Composite stage for reading JSONL files. |
|
Stage that processes a group of JSONL files into a DocumentBatch. This stage accepts FileGroupTasks created by FilePartitioningStage and reads the actual file contents into DocumentBatches. |
API#
- class stages.text.io.reader.jsonl.JsonlReader#
Bases:
nemo_curator.stages.base.CompositeStage
[nemo_curator.tasks._EmptyTask
,nemo_curator.tasks.DocumentBatch
]Composite stage for reading JSONL files.
This high-level stage decomposes into:
FilePartitioningStage - partitions files into groups
JsonlReaderStage - reads file groups into DocumentBatches
Initialization
- blocksize: int | str | None#
None
- decompose() list[stages.text.io.reader.jsonl.JsonlReaderStage] #
Decompose into file partitioning and processing stages.
- fields: list[str] | None#
None
- file_extensions: list[str]#
‘field(…)’
- file_paths: str | list[str]#
None
- files_per_partition: int | None#
None
- get_description() str #
Get a description of this composite stage.
- read_kwargs: dict[str, Any] | None#
None
- task_type: Literal[document, image, video, audio]#
‘document’
- class stages.text.io.reader.jsonl.JsonlReaderStage#
Bases:
stages.text.io.reader.base.BaseReader
Stage that processes a group of JSONL files into a DocumentBatch. This stage accepts FileGroupTasks created by FilePartitioningStage and reads the actual file contents into DocumentBatches.
Args: fields (list[str], optional): If specified, only read these fields (columns). Defaults to None. read_kwargs (dict[str, Any], optional): Keyword arguments for the reader. Defaults to {}. _generate_ids (bool): Whether to generate monotonically increasing IDs across all files. This uses IdGenerator actor, which needs to be instantiated before using this stage. This can be slow, so it is recommended to use AddId stage instead, unless monotonically increasing IDs are required. _assign_ids (bool): Whether to assign monotonically increasing IDs from an IdGenerator. This uses IdGenerator actor, which needs to be instantiated before using this stage. This can be slow, so it is recommended to use AddId stage instead, unless monotonically increasing IDs are required.
- read_data(
- paths: list[str],
- read_kwargs: dict[str, Any] | None = None,
- fields: list[str] | None = None,
Read JSONL files using Pandas.