> For clean Markdown of any page, append .md to the page URL.
> For a complete documentation index, see https://docs.nvidia.com/nemo/curator/llms.txt.
> For full documentation content, see https://docs.nvidia.com/nemo/curator/llms-full.txt.

# nemo_curator.utils.merge_file_prefixes

Simplified version of the tools/merge\_datasets.py script from the Megatron-LM library
([https://github.com/NVIDIA/Megatron-LM/blob/main/tools/merge\_datasets.py](https://github.com/NVIDIA/Megatron-LM/blob/main/tools/merge_datasets.py)).

## Module Contents

### Classes

| Name                                                                                     | Description                                                                         |
| ---------------------------------------------------------------------------------------- | ----------------------------------------------------------------------------------- |
| [`IndexedDatasetBuilder`](#nemo_curator-utils-merge_file_prefixes-IndexedDatasetBuilder) | Simplified version of the IndexedDatasetBuilder class from the Megatron-LM library. |
| [`_IndexWriter`](#nemo_curator-utils-merge_file_prefixes-_IndexWriter)                   | Simplified version of the \_IndexWriter class from the Megatron-LM library.         |

### Functions

| Name                                                                                       | Description                                    |
| ------------------------------------------------------------------------------------------ | ---------------------------------------------- |
| [`extract_index_contents`](#nemo_curator-utils-merge_file_prefixes-extract_index_contents) | Extract the index contents from the index file |
| [`get_args`](#nemo_curator-utils-merge_file_prefixes-get_args)                             | -                                              |
| [`merge_file_prefixes`](#nemo_curator-utils-merge_file_prefixes-merge_file_prefixes)       | -                                              |

### Data

[`_INDEX_HEADER`](#nemo_curator-utils-merge_file_prefixes-_INDEX_HEADER)

[`args`](#nemo_curator-utils-merge_file_prefixes-args)

### API

<Anchor id="nemo_curator-utils-merge_file_prefixes-IndexedDatasetBuilder">
  <CodeBlock showLineNumbers={false} wordWrap={true}>
    ```python
    class nemo_curator.utils.merge_file_prefixes.IndexedDatasetBuilder(
        bin_path: str,
        dtype: type[numpy.number]
    )
    ```
  </CodeBlock>
</Anchor>

<Indent>
  Simplified version of the IndexedDatasetBuilder class from the Megatron-LM library.

  Builder class for the IndexedDataset class

  **Parameters:**

  <ParamField path="bin_path" type="str">
    The path to the data (.bin) file
  </ParamField>

  <ParamField path="dtype" type="Type[np.number]">
    The dtype of the index file. Defaults to np.int32.
  </ParamField>

  <ParamField path="data_file" type="= open(bin_path, 'wb')" />

  <ParamField path="document_indices" type="= [0]" />

  <ParamField path="sequence_lengths" type="= []" />

  <Anchor id="nemo_curator-utils-merge_file_prefixes-IndexedDatasetBuilder-add_index">
    <CodeBlock showLineNumbers={false} wordWrap={true}>
      ```python
      nemo_curator.utils.merge_file_prefixes.IndexedDatasetBuilder.add_index(
          path_prefix: str
      ) -> None
      ```
    </CodeBlock>
  </Anchor>

  <Indent>
    Add an entire IndexedDataset to the dataset

    **Parameters:**

    <ParamField path="path_prefix" type="str">
      The index (.idx) and data (.bin) prefix
    </ParamField>
  </Indent>

  <Anchor id="nemo_curator-utils-merge_file_prefixes-IndexedDatasetBuilder-finalize">
    <CodeBlock showLineNumbers={false} wordWrap={true}>
      ```python
      nemo_curator.utils.merge_file_prefixes.IndexedDatasetBuilder.finalize(
          idx_path: str
      ) -> None
      ```
    </CodeBlock>
  </Anchor>

  <Indent>
    Clean up and write the index (.idx) file

    **Parameters:**

    <ParamField path="idx_path" type="str">
      The path to the index file
    </ParamField>
  </Indent>
</Indent>

<Anchor id="nemo_curator-utils-merge_file_prefixes-_IndexWriter">
  <CodeBlock showLineNumbers={false} wordWrap={true}>
    ```python
    class nemo_curator.utils.merge_file_prefixes._IndexWriter(
        idx_path: str,
        dtype: type[numpy.number]
    )
    ```
  </CodeBlock>
</Anchor>

<Indent>
  Simplified version of the \_IndexWriter class from the Megatron-LM library.

  Object class to write the index (.idx) file

  **Parameters:**

  <ParamField path="idx_path" type="str">
    The path to the index file
  </ParamField>

  <ParamField path="dtype" type="Type[np.number]">
    The dtype of the index file
  </ParamField>

  <Anchor id="nemo_curator-utils-merge_file_prefixes-_IndexWriter-__enter__">
    <CodeBlock links={{"nemo_curator.utils.merge_file_prefixes._IndexWriter":"#nemo_curator-utils-merge_file_prefixes-_IndexWriter"}} showLineNumbers={false} wordWrap={true}>
      ```python
      nemo_curator.utils.merge_file_prefixes._IndexWriter.__enter__() -> nemo_curator.utils.merge_file_prefixes._IndexWriter
      ```
    </CodeBlock>
  </Anchor>

  <Indent>
    Enter the context introduced by the 'with' keyword

    **Returns:** `_IndexWriter`

    The instance
  </Indent>

  <Anchor id="nemo_curator-utils-merge_file_prefixes-_IndexWriter-__exit__">
    <CodeBlock showLineNumbers={false} wordWrap={true}>
      ```python
      nemo_curator.utils.merge_file_prefixes._IndexWriter.__exit__(
          exc_type: type[BaseException] | None,
          exc_val: BaseException | None,
          exc_tb: types.TracebackType | None
      ) -> bool | None
      ```
    </CodeBlock>
  </Anchor>

  <Indent>
    Exit the context introduced by the 'with' keyword

    **Parameters:**

    <ParamField path="exc_type" type="Optional[Type[BaseException]]">
      Exception type
    </ParamField>

    <ParamField path="exc_val" type="Optional[BaseException]">
      Exception value
    </ParamField>

    <ParamField path="exc_tb" type="Optional[TracebackType]">
      Exception traceback object
    </ParamField>

    **Returns:** `bool | None`

    Optional\[bool]: Whether to silence the exception
  </Indent>

  <Anchor id="nemo_curator-utils-merge_file_prefixes-_IndexWriter-_sequence_pointers">
    <CodeBlock showLineNumbers={false} wordWrap={true}>
      ```python
      nemo_curator.utils.merge_file_prefixes._IndexWriter._sequence_pointers(
          sequence_lengths: collections.abc.Iterable[int | numpy.integer]
      ) -> list[int]
      ```
    </CodeBlock>
  </Anchor>

  <Indent>
    Build the sequence pointers per the sequence lengths and dtype size

    **Parameters:**

    <ParamField path="sequence_lengths" type="List[int]">
      The length of each sequence
    </ParamField>

    **Returns:** `list[int]`

    List\[int]: The pointer to the beginning of each sequence
  </Indent>

  <Anchor id="nemo_curator-utils-merge_file_prefixes-_IndexWriter-write">
    <CodeBlock showLineNumbers={false} wordWrap={true}>
      ```python
      nemo_curator.utils.merge_file_prefixes._IndexWriter.write(
          sequence_lengths: collections.abc.Iterable[int | numpy.integer],
          document_indices: collections.abc.Iterable[int | numpy.integer]
      ) -> None
      ```
    </CodeBlock>
  </Anchor>

  <Indent>
    Write the index (.idx) file

    **Parameters:**

    <ParamField path="sequence_lengths" type="List[int]">
      The length of each sequence
    </ParamField>

    <ParamField path="document_indices" type="List[int]">
      The sequence indices demarcating the end of each document
    </ParamField>
  </Indent>
</Indent>

<Anchor id="nemo_curator-utils-merge_file_prefixes-extract_index_contents">
  <CodeBlock showLineNumbers={false} wordWrap={true}>
    ```python
    nemo_curator.utils.merge_file_prefixes.extract_index_contents(
        idx_path: str
    ) -> tuple[numpy.ndarray, numpy.ndarray, type[numpy.number]]
    ```
  </CodeBlock>
</Anchor>

<Indent>
  Extract the index contents from the index file

  **Parameters:**

  <ParamField path="idx_path" type="str">
    The path to the index file
  </ParamField>

  **Returns:** `tuple[np.ndarray, np.ndarray, type[np.number]]`

  Tuple\[np.ndarray, np.ndarray, Type\[np.number]]: The sequence lengths, document indices and dtype
  of the index file
</Indent>

<Anchor id="nemo_curator-utils-merge_file_prefixes-get_args">
  <CodeBlock showLineNumbers={false} wordWrap={true}>
    ```python
    nemo_curator.utils.merge_file_prefixes.get_args() -> argparse.Namespace
    ```
  </CodeBlock>
</Anchor>

<Indent />

<Anchor id="nemo_curator-utils-merge_file_prefixes-merge_file_prefixes">
  <CodeBlock showLineNumbers={false} wordWrap={true}>
    ```python
    nemo_curator.utils.merge_file_prefixes.merge_file_prefixes(
        input_dir: str,
        output_prefix: str
    ) -> None
    ```
  </CodeBlock>
</Anchor>

<Indent />

<Anchor id="nemo_curator-utils-merge_file_prefixes-_INDEX_HEADER">
  <CodeBlock showLineNumbers={false} wordWrap={true}>
    ```python
    nemo_curator.utils.merge_file_prefixes._INDEX_HEADER = b'MMIDIDX\x00\x00'
    ```
  </CodeBlock>
</Anchor>

<Anchor id="nemo_curator-utils-merge_file_prefixes-args">
  <CodeBlock showLineNumbers={false} wordWrap={true}>
    ```python
    nemo_curator.utils.merge_file_prefixes.args = get_args()
    ```
  </CodeBlock>
</Anchor>