nemo_curator.utils.merge_file_prefixes
nemo_curator.utils.merge_file_prefixes
Simplified version of the tools/merge_datasets.py script from the Megatron-LM library (https://github.com/NVIDIA/Megatron-LM/blob/main/tools/merge_datasets.py).
Module Contents
Classes
Functions
Data
API
Simplified version of the IndexedDatasetBuilder class from the Megatron-LM library.
Builder class for the IndexedDataset class
Parameters:
The path to the data (.bin) file
The dtype of the index file. Defaults to np.int32.
Add an entire IndexedDataset to the dataset
Parameters:
The index (.idx) and data (.bin) prefix
Clean up and write the index (.idx) file
Parameters:
The path to the index file
Simplified version of the _IndexWriter class from the Megatron-LM library.
Object class to write the index (.idx) file
Parameters:
The path to the index file
The dtype of the index file
Enter the context introduced by the ‘with’ keyword
Returns: _IndexWriter
The instance
Exit the context introduced by the ‘with’ keyword
Parameters:
Exception type
Exception value
Exception traceback object
Returns: bool | None
Optional[bool]: Whether to silence the exception
Build the sequence pointers per the sequence lengths and dtype size
Parameters:
The length of each sequence
Returns: list[int]
List[int]: The pointer to the beginning of each sequence
Write the index (.idx) file
Parameters:
The length of each sequence
The sequence indices demarcating the end of each document
Extract the index contents from the index file
Parameters:
The path to the index file
Returns: tuple[np.ndarray, np.ndarray, type[np.number]]
Tuple[np.ndarray, np.ndarray, Type[np.number]]: The sequence lengths, document indices and dtype of the index file