Important

NeMo 2.0 is an experimental feature and currently released in the dev container only: nvcr.io/nvidia/nemo:dev. Please refer to NeMo 2.0 overview for information on getting started.

Miscellaneous

class nemo_curator.Sequential(modules)

@nemo_curator.utils.decorators.batched

Marks a function as accepting a pandas series of elements instead of a single element

Parameters: function – The function that accepts a batch of elements

class nemo_curator.AddId(id_field, id_prefix: str = 'doc_id', start_index: Optional[int] = None)

class nemo_curator.blend_datasets(target_size: int, datasets: List[nemo_curator.datasets.doc_dataset.DocumentDataset], sampling_weights: List[float])

Combined multiple datasets into one with different amounts of each dataset :param target_size: The number of documents the resulting dataset should have.

The actual size of the dataset may be slightly larger if the normalized weights do not allow for even mixtures of the datasets.

Parameters

datasets – A list of all datasets to combine together
sampling_weights – A list of weights to assign to each dataset in the input. Weights will be normalized across the whole list as a part of the sampling process. For example, if the normalized sampling weight for dataset 1 is 0.02, 2% ofthe total samples will be sampled from dataset 1. There are guaranteed to be math.ceil(normalized_weight_i * target_size) elements from dataset i in the final blend.

class nemo_curator.Shuffle(seed: typing.Optional[int] = None, npartitions: typing.Optional[int] = None, partition_to_filename: typing.Callable[[int], str] = <function default_filename>)