> For clean Markdown of any page, append .md to the page URL.
> For a complete documentation index, see https://docs.nvidia.com/nemo/curator/llms.txt.
> For full documentation content, see https://docs.nvidia.com/nemo/curator/llms-full.txt.

# nemo_curator.stages.text.io.writer.megatron_tokenizer

## Module Contents

### Classes

| Name                                                                                                        | Description                                                           |
| ----------------------------------------------------------------------------------------------------------- | --------------------------------------------------------------------- |
| [`MegatronTokenizerWriter`](#nemo_curator-stages-text-io-writer-megatron_tokenizer-MegatronTokenizerWriter) | Writer that writes a DocumentBatch to Megatron ready tokenized files. |

### Data

[`_INDEX_HEADER`](#nemo_curator-stages-text-io-writer-megatron_tokenizer-_INDEX_HEADER)

### API

<Anchor id="nemo_curator-stages-text-io-writer-megatron_tokenizer-MegatronTokenizerWriter">
  <CodeBlock showLineNumbers={false} wordWrap={true}>
    ```python
    class nemo_curator.stages.text.io.writer.megatron_tokenizer.MegatronTokenizerWriter(
        path: str,
        file_extension: list[str] = (lambda: FILETYPE_TO_DEFAUL...,
        write_kwargs: dict[str, typing.Any] = dict(),
        fields: list[str] | None = None,
        name: str = 'megatron_tokenizer_writer',
        mode: typing.Literal['ignore', 'overwrite', 'append', 'error'] = 'ignore',
        append_mode_implemented: bool = False,
        model_identifier: str | None = None,
        cache_dir: str | None = None,
        hf_token: str | None = None,
        text_field: str = 'text',
        tokenization_batch_size: int = 1000,
        append_eod: bool = False,
        transformers_init_kwargs: dict[str, typing.Any] = dict()
    )
    ```
  </CodeBlock>
</Anchor>

<Indent>
  <Badge>
    Dataclass
  </Badge>

  **Bases:** [BaseWriter](/nemo-curator/nemo_curator/stages/text/io/writer/base#nemo_curator-stages-text-io-writer-base-BaseWriter)

  Writer that writes a DocumentBatch to Megatron ready tokenized files.

  <ParamField path="append_eod" type="bool = False" />

  <ParamField path="cache_dir" type="str | None = None" />

  <ParamField path="fields" type="list[str] | None = field(default=None, init=False, repr=False)" />

  <ParamField path="file_extension" type="list[str]" />

  <ParamField path="hf_token" type="str | None = None" />

  <ParamField path="model_identifier" type="str | None = None" />

  <ParamField path="name" type="str = 'megatron_tokenizer_writer'" />

  <ParamField path="text_field" type="str = 'text'" />

  <ParamField path="tokenization_batch_size" type="int = 1000" />

  <ParamField path="transformers_init_kwargs" type="dict[str, Any] = field(default_factory=dict)" />

  <Anchor id="nemo_curator-stages-text-io-writer-megatron_tokenizer-MegatronTokenizerWriter-__post_init__">
    <CodeBlock showLineNumbers={false} wordWrap={true}>
      ```python
      nemo_curator.stages.text.io.writer.megatron_tokenizer.MegatronTokenizerWriter.__post_init__()
      ```
    </CodeBlock>
  </Anchor>

  <Indent />

  <Anchor id="nemo_curator-stages-text-io-writer-megatron_tokenizer-MegatronTokenizerWriter-_sequence_pointers">
    <CodeBlock showLineNumbers={false} wordWrap={true}>
      ```python
      nemo_curator.stages.text.io.writer.megatron_tokenizer.MegatronTokenizerWriter._sequence_pointers(
          sequence_lengths: list[int],
          token_size: int
      ) -> list[int]
      ```
    </CodeBlock>
  </Anchor>

  <Indent>
    <Badge>
      staticmethod
    </Badge>

    Build the sequence pointers per the sequence lengths and dtype size

    Returns:
    list\[int]: The pointer to the beginning of each sequence

    **Parameters:**

    <ParamField path="sequence_lengths" type="list[int]">
      The length of each sequence
    </ParamField>

    <ParamField path="token_size" type="int">
      The size of each token in bytes
    </ParamField>
  </Indent>

  <Anchor id="nemo_curator-stages-text-io-writer-megatron_tokenizer-MegatronTokenizerWriter-process">
    <CodeBlock links={{"nemo_curator.tasks.DocumentBatch":"/nemo-curator/nemo_curator/tasks/document#nemo_curator-tasks-document-DocumentBatch","nemo_curator.tasks.FileGroupTask":"/nemo-curator/nemo_curator/tasks/file_group#nemo_curator-tasks-file_group-FileGroupTask"}} showLineNumbers={false} wordWrap={true}>
      ```python
      nemo_curator.stages.text.io.writer.megatron_tokenizer.MegatronTokenizerWriter.process(
          task: nemo_curator.tasks.DocumentBatch
      ) -> nemo_curator.tasks.FileGroupTask
      ```
    </CodeBlock>
  </Anchor>

  <Indent />

  <Anchor id="nemo_curator-stages-text-io-writer-megatron_tokenizer-MegatronTokenizerWriter-setup">
    <CodeBlock links={{"nemo_curator.backends.base.WorkerMetadata":"/nemo-curator/nemo_curator/backends/base#nemo_curator-backends-base-WorkerMetadata"}} showLineNumbers={false} wordWrap={true}>
      ```python
      nemo_curator.stages.text.io.writer.megatron_tokenizer.MegatronTokenizerWriter.setup(
          _worker_metadata: nemo_curator.backends.base.WorkerMetadata | None = None
      ) -> None
      ```
    </CodeBlock>
  </Anchor>

  <Indent />

  <Anchor id="nemo_curator-stages-text-io-writer-megatron_tokenizer-MegatronTokenizerWriter-setup_on_node">
    <CodeBlock links={{"nemo_curator.backends.base.NodeInfo":"/nemo-curator/nemo_curator/backends/base#nemo_curator-backends-base-NodeInfo","nemo_curator.backends.base.WorkerMetadata":"/nemo-curator/nemo_curator/backends/base#nemo_curator-backends-base-WorkerMetadata"}} showLineNumbers={false} wordWrap={true}>
      ```python
      nemo_curator.stages.text.io.writer.megatron_tokenizer.MegatronTokenizerWriter.setup_on_node(
          _node_info: nemo_curator.backends.base.NodeInfo | None = None,
          _worker_metadata: nemo_curator.backends.base.WorkerMetadata = None
      ) -> None
      ```
    </CodeBlock>
  </Anchor>

  <Indent />

  <Anchor id="nemo_curator-stages-text-io-writer-megatron_tokenizer-MegatronTokenizerWriter-write_data">
    <CodeBlock showLineNumbers={false} wordWrap={true}>
      ```python
      nemo_curator.stages.text.io.writer.megatron_tokenizer.MegatronTokenizerWriter.write_data(
          bin_file: typing.BinaryIO,
          token_dtype: numpy.dtype,
          eod_token_id: int,
          tokens_batch: list[list[int]],
          sequence_lengths: list[int]
      ) -> None
      ```
    </CodeBlock>
  </Anchor>

  <Indent>
    Write tokens to the .bin file
    Args:
    tokens\_batch (list\[list\[int]]): The batch of tokens to write
  </Indent>

  <Anchor id="nemo_curator-stages-text-io-writer-megatron_tokenizer-MegatronTokenizerWriter-write_idx_data">
    <CodeBlock showLineNumbers={false} wordWrap={true}>
      ```python
      nemo_curator.stages.text.io.writer.megatron_tokenizer.MegatronTokenizerWriter.write_idx_data(
          file_prefix: str,
          token_size: int,
          token_dtype_code: int,
          sequence_lengths: list[int]
      ) -> None
      ```
    </CodeBlock>
  </Anchor>

  <Indent>
    Write the .idx file data
  </Indent>
</Indent>

<Anchor id="nemo_curator-stages-text-io-writer-megatron_tokenizer-_INDEX_HEADER">
  <CodeBlock showLineNumbers={false} wordWrap={true}>
    ```python
    nemo_curator.stages.text.io.writer.megatron_tokenizer._INDEX_HEADER = b'MMIDIDX\x00\x00'
    ```
  </CodeBlock>
</Anchor>