nemo_curator.utils.merge_file_prefixes

View as Markdown

Simplified version of the tools/merge_datasets.py script from the Megatron-LM library (https://github.com/NVIDIA/Megatron-LM/blob/main/tools/merge_datasets.py).

Module Contents

Classes

NameDescription
IndexedDatasetBuilderSimplified version of the IndexedDatasetBuilder class from the Megatron-LM library.
_IndexWriterSimplified version of the _IndexWriter class from the Megatron-LM library.

Functions

NameDescription
extract_index_contentsExtract the index contents from the index file
get_args-
merge_file_prefixes-

Data

_INDEX_HEADER

args

API

class nemo_curator.utils.merge_file_prefixes.IndexedDatasetBuilder(
bin_path: str,
dtype: type[numpy.number]
)

Simplified version of the IndexedDatasetBuilder class from the Megatron-LM library.

Builder class for the IndexedDataset class

Parameters:

bin_path
str

The path to the data (.bin) file

dtype
Type[np.number]

The dtype of the index file. Defaults to np.int32.

data_file
= open(bin_path, 'wb')
document_indices
= [0]
sequence_lengths
= []
nemo_curator.utils.merge_file_prefixes.IndexedDatasetBuilder.add_index(
path_prefix: str
) -> None

Add an entire IndexedDataset to the dataset

Parameters:

path_prefix
str

The index (.idx) and data (.bin) prefix

nemo_curator.utils.merge_file_prefixes.IndexedDatasetBuilder.finalize(
idx_path: str
) -> None

Clean up and write the index (.idx) file

Parameters:

idx_path
str

The path to the index file

class nemo_curator.utils.merge_file_prefixes._IndexWriter(
idx_path: str,
dtype: type[numpy.number]
)

Simplified version of the _IndexWriter class from the Megatron-LM library.

Object class to write the index (.idx) file

Parameters:

idx_path
str

The path to the index file

dtype
Type[np.number]

The dtype of the index file

nemo_curator.utils.merge_file_prefixes._IndexWriter.__enter__() -> nemo_curator.utils.merge_file_prefixes._IndexWriternemo_curator.utils.merge_file_prefixes._IndexWriter.__enter__() -> nemo_curator.utils.merge_file_prefixes._IndexWriter

Enter the context introduced by the ‘with’ keyword

Returns: _IndexWriter

The instance

nemo_curator.utils.merge_file_prefixes._IndexWriter.__exit__(
exc_type: type[BaseException] | None,
exc_val: BaseException | None,
exc_tb: types.TracebackType | None
) -> bool | None

Exit the context introduced by the ‘with’ keyword

Parameters:

exc_type
Optional[Type[BaseException]]

Exception type

exc_val
Optional[BaseException]

Exception value

exc_tb
Optional[TracebackType]

Exception traceback object

Returns: bool | None

Optional[bool]: Whether to silence the exception

nemo_curator.utils.merge_file_prefixes._IndexWriter._sequence_pointers(
sequence_lengths: collections.abc.Iterable[int | numpy.integer]
) -> list[int]

Build the sequence pointers per the sequence lengths and dtype size

Parameters:

sequence_lengths
List[int]

The length of each sequence

Returns: list[int]

List[int]: The pointer to the beginning of each sequence

nemo_curator.utils.merge_file_prefixes._IndexWriter.write(
sequence_lengths: collections.abc.Iterable[int | numpy.integer],
document_indices: collections.abc.Iterable[int | numpy.integer]
) -> None

Write the index (.idx) file

Parameters:

sequence_lengths
List[int]

The length of each sequence

document_indices
List[int]

The sequence indices demarcating the end of each document

nemo_curator.utils.merge_file_prefixes.extract_index_contents(
idx_path: str
) -> tuple[numpy.ndarray, numpy.ndarray, type[numpy.number]]

Extract the index contents from the index file

Parameters:

idx_path
str

The path to the index file

Returns: tuple[np.ndarray, np.ndarray, type[np.number]]

Tuple[np.ndarray, np.ndarray, Type[np.number]]: The sequence lengths, document indices and dtype of the index file

nemo_curator.utils.merge_file_prefixes.get_args() -> argparse.Namespace
nemo_curator.utils.merge_file_prefixes.merge_file_prefixes(
input_dir: str,
output_prefix: str
) -> None
nemo_curator.utils.merge_file_prefixes._INDEX_HEADER = b'MMIDIDX\x00\x00'
nemo_curator.utils.merge_file_prefixes.args = get_args()