Important

You are viewing the NeMo 2.0 documentation. This release introduces significant changes to the API and a new library, NeMo Run. We are currently porting all features from NeMo 1.0 to 2.0. For documentation on previous versions or features not yet available in 2.0, please refer to the NeMo 24.07 documentation.

Miscellaneous#

class nemo_curator.Sequential(modules)#

@nemo_curator.utils.decorators.batched#

Marks a function as accepting a pandas series of elements instead of a single element

Parameters:: function – The function that accepts a batch of elements

class nemo_curator.AddId( id_field, id_prefix: str = 'doc_id', start_index: int | None = None, )#

class nemo_curator.blend_datasets( target_size: int, datasets: List[DocumentDataset], sampling_weights: List[float], )#

Combined multiple datasets into one with different amounts of each dataset :param target_size: The number of documents the resulting dataset should have.

The actual size of the dataset may be slightly larger if the normalized weights do not allow for even mixtures of the datasets.

Parameters:

datasets – A list of all datasets to combine together
sampling_weights – A list of weights to assign to each dataset in the input. Weights will be normalized across the whole list as a part of the sampling process. For example, if the normalized sampling weight for dataset 1 is 0.02, 2% ofthe total samples will be sampled from dataset 1. There are guaranteed to be math.ceil(normalized_weight_i * target_size) elements from dataset i in the final blend.

class nemo_curator.Shuffle(seed: int | None = None, npartitions: int | None = None, partition_to_filename: ~typing.Callable[[int], str] = <function default_filename>)#