modules.dataset_ops#

Module Contents#

Classes#

Shuffle

Base class for all NeMo Curator modules.

Functions#

blend_datasets

Combines multiple datasets into one with different amounts of each dataset. Args: target_size: The number of documents the resulting dataset should have. The actual size of the dataset may be slightly larger if the normalized weights do not allow for even mixtures of the datasets. datasets: A list of all datasets to combine together sampling_weights: A list of weights to assign to each dataset in the input. Weights will be normalized across the whole list as a part of the sampling process. For example, if the normalized sampling weight for dataset 1 is 0.02, 2% ofthe total samples will be sampled from dataset 1. There are guaranteed to be math.ceil(normalized_weight_i * target_size) elements from dataset i in the final blend.

default_filename

API#

class modules.dataset_ops.Shuffle(
seed: int | None = None,
npartitions: int | None = None,
partition_to_filename: collections.abc.Callable[[int], str] = default_filename,
filename_col: str = 'file_name',
)#

Bases: nemo_curator.modules.base.BaseModule

Base class for all NeMo Curator modules.

Handles validating that data lives on the correct device for each module

Initialization

Randomly permutes the dataset. This will make the original filename_col column invalid, so if the column is present it will be overwritten. Args: seed: The random seed that will be used to determine which partition (file) each datapoint goes to. Setting the seed will guarantee determinism, but may be slightly slower (20-30% slower) depending on the dataset size. npartitions: The output number of partitions to create in the dataset. If None, it will retain the same number of partitions as the original dataset. partition_to_filename: If the filename column is present, it will be overwritten. Passing a function in through this argument allows the user to configure what the filename will look like given the partition number. The default method names the partition f’file_{partition_num:010d}.jsonl’ and should be changed if the user is not using a .jsonl format.

call(
dataset: nemo_curator.datasets.doc_dataset.DocumentDataset,
) nemo_curator.datasets.doc_dataset.DocumentDataset#

Performs an arbitrary operation on a dataset

Args: dataset (DocumentDataset): The dataset to operate on

shuffle_deterministic(
dataset: nemo_curator.datasets.doc_dataset.DocumentDataset,
) nemo_curator.datasets.doc_dataset.DocumentDataset#
shuffle_nondeterministic(
dataset: nemo_curator.datasets.doc_dataset.DocumentDataset,
) nemo_curator.datasets.doc_dataset.DocumentDataset#
modules.dataset_ops.blend_datasets(
target_size: int,
datasets: list[nemo_curator.datasets.doc_dataset.DocumentDataset],
sampling_weights: list[float],
) nemo_curator.datasets.doc_dataset.DocumentDataset#

Combines multiple datasets into one with different amounts of each dataset. Args: target_size: The number of documents the resulting dataset should have. The actual size of the dataset may be slightly larger if the normalized weights do not allow for even mixtures of the datasets. datasets: A list of all datasets to combine together sampling_weights: A list of weights to assign to each dataset in the input. Weights will be normalized across the whole list as a part of the sampling process. For example, if the normalized sampling weight for dataset 1 is 0.02, 2% ofthe total samples will be sampled from dataset 1. There are guaranteed to be math.ceil(normalized_weight_i * target_size) elements from dataset i in the final blend.

modules.dataset_ops.default_filename(partition_num: int) str#