> For clean Markdown of any page, append .md to the page URL.
> For a complete documentation index, see https://docs.nvidia.com/nemo/curator/llms.txt.
> For full documentation content, see https://docs.nvidia.com/nemo/curator/llms-full.txt.
> For AI client integration (Claude Code, Cursor, etc.), connect to the MCP server at https://docs.nvidia.com/nemo/curator/_mcp/server.

# nemo_curator.utils.merge_file_prefixes

Simplified version of the tools/merge\_datasets.py script from the Megatron-LM library
([https://github.com/NVIDIA/Megatron-LM/blob/main/tools/merge\_datasets.py](https://github.com/NVIDIA/Megatron-LM/blob/main/tools/merge_datasets.py)).

## Module Contents

### Classes

| Name                                                                                     | Description                                                                         |
| ---------------------------------------------------------------------------------------- | ----------------------------------------------------------------------------------- |
| [`IndexedDatasetBuilder`](#nemo_curator-utils-merge_file_prefixes-IndexedDatasetBuilder) | Simplified version of the IndexedDatasetBuilder class from the Megatron-LM library. |
| [`_IndexWriter`](#nemo_curator-utils-merge_file_prefixes-_IndexWriter)                   | Simplified version of the \_IndexWriter class from the Megatron-LM library.         |

### Functions

| Name                                                                                       | Description                                    |
| ------------------------------------------------------------------------------------------ | ---------------------------------------------- |
| [`extract_index_contents`](#nemo_curator-utils-merge_file_prefixes-extract_index_contents) | Extract the index contents from the index file |
| [`get_args`](#nemo_curator-utils-merge_file_prefixes-get_args)                             | -                                              |
| [`merge_file_prefixes`](#nemo_curator-utils-merge_file_prefixes-merge_file_prefixes)       | -                                              |

### Data

[`_INDEX_HEADER`](#nemo_curator-utils-merge_file_prefixes-_INDEX_HEADER)

[`args`](#nemo_curator-utils-merge_file_prefixes-args)

### API

```python
class nemo_curator.utils.merge_file_prefixes.IndexedDatasetBuilder(
    bin_path: str,
    dtype: type[numpy.number]
)
```

Simplified version of the IndexedDatasetBuilder class from the Megatron-LM library.

Builder class for the IndexedDataset class

**Parameters:**

The path to the data (.bin) file

The dtype of the index file. Defaults to np.int32.

```python
nemo_curator.utils.merge_file_prefixes.IndexedDatasetBuilder.add_index(
    path_prefix: str
) -> None
```

Add an entire IndexedDataset to the dataset

**Parameters:**

The index (.idx) and data (.bin) prefix

```python
nemo_curator.utils.merge_file_prefixes.IndexedDatasetBuilder.finalize(
    idx_path: str
) -> None
```

Clean up and write the index (.idx) file

**Parameters:**

The path to the index file

```python
class nemo_curator.utils.merge_file_prefixes._IndexWriter(
    idx_path: str,
    dtype: type[numpy.number]
)
```

Simplified version of the \_IndexWriter class from the Megatron-LM library.

Object class to write the index (.idx) file

**Parameters:**

The path to the index file

The dtype of the index file

```python
nemo_curator.utils.merge_file_prefixes._IndexWriter.__enter__() -> nemo_curator.utils.merge_file_prefixes._IndexWriter
```

Enter the context introduced by the 'with' keyword

**Returns:** `_IndexWriter`

The instance

```python
nemo_curator.utils.merge_file_prefixes._IndexWriter.__exit__(
    exc_type: type[BaseException] | None,
    exc_val: BaseException | None,
    exc_tb: types.TracebackType | None
) -> bool | None
```

Exit the context introduced by the 'with' keyword

**Parameters:**

Exception type

Exception value

Exception traceback object

**Returns:** `bool | None`

Optional\[bool]: Whether to silence the exception

```python
nemo_curator.utils.merge_file_prefixes._IndexWriter._sequence_pointers(
    sequence_lengths: collections.abc.Iterable[int | numpy.integer]
) -> list[int]
```

Build the sequence pointers per the sequence lengths and dtype size

**Parameters:**

The length of each sequence

**Returns:** `list[int]`

List\[int]: The pointer to the beginning of each sequence

```python
nemo_curator.utils.merge_file_prefixes._IndexWriter.write(
    sequence_lengths: collections.abc.Iterable[int | numpy.integer],
    document_indices: collections.abc.Iterable[int | numpy.integer]
) -> None
```

Write the index (.idx) file

**Parameters:**

The length of each sequence

The sequence indices demarcating the end of each document

```python
nemo_curator.utils.merge_file_prefixes.extract_index_contents(
    idx_path: str
) -> tuple[numpy.ndarray, numpy.ndarray, type[numpy.number]]
```

Extract the index contents from the index file

**Parameters:**

The path to the index file

**Returns:** `tuple[np.ndarray, np.ndarray, type[np.number]]`

Tuple\[np.ndarray, np.ndarray, Type\[np.number]]: The sequence lengths, document indices and dtype
of the index file

```python
nemo_curator.utils.merge_file_prefixes.get_args() -> argparse.Namespace
```

```python
nemo_curator.utils.merge_file_prefixes.merge_file_prefixes(
    input_dir: str,
    output_prefix: str
) -> None
```

```python
nemo_curator.utils.merge_file_prefixes._INDEX_HEADER = b'MMIDIDX\x00\x00'
```

```python
nemo_curator.utils.merge_file_prefixes.args = get_args()
```