> For clean Markdown of any page, append .md to the page URL.
> For a complete documentation index, see https://docs.nvidia.com/nemo/curator/llms.txt.
> For full documentation content, see https://docs.nvidia.com/nemo/curator/llms-full.txt.

# nemo_curator.stages.text.io.reader.jsonl

## Module Contents

### Classes

| Name                                                                             | Description                                                       |
| -------------------------------------------------------------------------------- | ----------------------------------------------------------------- |
| [`JsonlReader`](#nemo_curator-stages-text-io-reader-jsonl-JsonlReader)           | Composite stage for reading JSONL files.                          |
| [`JsonlReaderStage`](#nemo_curator-stages-text-io-reader-jsonl-JsonlReaderStage) | Stage that processes a group of JSONL files into a DocumentBatch. |

### API

<Anchor id="nemo_curator-stages-text-io-reader-jsonl-JsonlReader">
  <CodeBlock showLineNumbers={false} wordWrap={true}>
    ```python
    class nemo_curator.stages.text.io.reader.jsonl.JsonlReader(
        file_paths: str | list[str],
        files_per_partition: int | None = None,
        blocksize: int | str | None = None,
        fields: list[str] | None = None,
        read_kwargs: dict[str, typing.Any] | None = None,
        task_type: typing.Literal['document', 'image', 'video', 'audio'] = 'document',
        file_extensions: list[str] = (lambda: FILETYPE_TO_DEFAUL...,
        _generate_ids: bool = False,
        _assign_ids: bool = False,
        name: str = 'jsonl_reader'
    )
    ```
  </CodeBlock>
</Anchor>

<Indent>
  <Badge>
    Dataclass
  </Badge>

  **Bases:** [CompositeStage\[\_EmptyTask, DocumentBatch\]](/nemo-curator/nemo_curator/stages/base#nemo_curator-stages-base-CompositeStage)

  Composite stage for reading JSONL files.

  This high-level stage decomposes into:

  1. FilePartitioningStage - partitions files into groups
  2. JsonlReaderStage - reads file groups into DocumentBatches

  <ParamField path="_assign_ids" type="bool = False" />

  <ParamField path="_generate_ids" type="bool = False" />

  <ParamField path="blocksize" type="int | str | None = None" />

  <ParamField path="fields" type="list[str] | None = None" />

  <ParamField path="file_extensions" type="list[str]" />

  <ParamField path="file_paths" type="str | list[str]" />

  <ParamField path="files_per_partition" type="int | None = None" />

  <ParamField path="name" type="str = 'jsonl_reader'" />

  <ParamField path="read_kwargs" type="dict[str, Any] | None = None" />

  <ParamField path="task_type" type="Literal['document', 'image', 'video', 'audio'] = 'document'" />

  <Anchor id="nemo_curator-stages-text-io-reader-jsonl-JsonlReader-__post_init__">
    <CodeBlock showLineNumbers={false} wordWrap={true}>
      ```python
      nemo_curator.stages.text.io.reader.jsonl.JsonlReader.__post_init__()
      ```
    </CodeBlock>
  </Anchor>

  <Indent>
    Initialize parent class after dataclass initialization.
  </Indent>

  <Anchor id="nemo_curator-stages-text-io-reader-jsonl-JsonlReader-decompose">
    <CodeBlock links={{"nemo_curator.stages.text.io.reader.jsonl.JsonlReaderStage":"#nemo_curator-stages-text-io-reader-jsonl-JsonlReaderStage"}} showLineNumbers={false} wordWrap={true}>
      ```python
      nemo_curator.stages.text.io.reader.jsonl.JsonlReader.decompose() -> list[nemo_curator.stages.text.io.reader.jsonl.JsonlReaderStage]
      ```
    </CodeBlock>
  </Anchor>

  <Indent>
    Decompose into file partitioning and processing stages.
  </Indent>

  <Anchor id="nemo_curator-stages-text-io-reader-jsonl-JsonlReader-get_description">
    <CodeBlock showLineNumbers={false} wordWrap={true}>
      ```python
      nemo_curator.stages.text.io.reader.jsonl.JsonlReader.get_description() -> str
      ```
    </CodeBlock>
  </Anchor>

  <Indent>
    Get a description of this composite stage.
  </Indent>
</Indent>

<Anchor id="nemo_curator-stages-text-io-reader-jsonl-JsonlReaderStage">
  <CodeBlock showLineNumbers={false} wordWrap={true}>
    ```python
    class nemo_curator.stages.text.io.reader.jsonl.JsonlReaderStage(
        fields: list[str] | None = None,
        read_kwargs: dict[str, typing.Any] = dict(),
        name: str = 'jsonl_reader',
        _generate_ids: bool = False,
        _assign_ids: bool = False
    )
    ```
  </CodeBlock>
</Anchor>

<Indent>
  <Badge>
    Dataclass
  </Badge>

  **Bases:** [BaseReader](/nemo-curator/nemo_curator/stages/text/io/reader/base#nemo_curator-stages-text-io-reader-base-BaseReader)

  Stage that processes a group of JSONL files into a DocumentBatch.
  This stage accepts FileGroupTasks created by FilePartitioningStage
  and reads the actual file contents into DocumentBatches.

  **Parameters:**

  <ParamField path="fields" type="list[str]" default="None">
    If specified, only read these fields (columns). Defaults to None.
  </ParamField>

  <ParamField path="read_kwargs" type="dict[str, Any]" default="dict()">
    Keyword arguments for the reader. Defaults to \{}.
  </ParamField>

  <ParamField path="_generate_ids" type="bool" default="False">
    Whether to generate monotonically increasing IDs across all files.
    This uses IdGenerator actor, which needs to be instantiated before using this stage.
    This can be slow, so it is recommended to use AddId stage instead, unless monotonically increasing IDs
    are required.
  </ParamField>

  <ParamField path="_assign_ids" type="bool" default="False">
    Whether to assign monotonically increasing IDs from an IdGenerator.
    This uses IdGenerator actor, which needs to be instantiated before using this stage.
    This can be slow, so it is recommended to use AddId stage instead, unless monotonically increasing IDs
    are required.
  </ParamField>

  <ParamField path="name" type="str = 'jsonl_reader'" />

  <Anchor id="nemo_curator-stages-text-io-reader-jsonl-JsonlReaderStage-read_data">
    <CodeBlock showLineNumbers={false} wordWrap={true}>
      ```python
      nemo_curator.stages.text.io.reader.jsonl.JsonlReaderStage.read_data(
          paths: list[str],
          read_kwargs: dict[str, typing.Any] | None = None,
          fields: list[str] | None = None
      ) -> pandas.DataFrame | None
      ```
    </CodeBlock>
  </Anchor>

  <Indent>
    Read JSONL files using Pandas.
  </Indent>
</Indent>