Miscellaneous#

class nemo_curator.Sequential(modules)#
@nemo_curator.utils.decorators.batched#

Marks a function as accepting a pandas series of elements instead of a single element

Parameters:

function – The function that accepts a batch of elements

class nemo_curator.AddId(
id_field,
id_prefix: str = 'doc_id',
start_index: int | None = None,
)#
call(
dataset: DocumentDataset,
) DocumentDataset#

Performs an arbitrary operation on a dataset

Parameters:

dataset (DocumentDataset) – The dataset to operate on

class nemo_curator.blend_datasets(
target_size: int,
datasets: List[DocumentDataset],
sampling_weights: List[float],
)#

Combines multiple datasets into one with different amounts of each dataset. :param target_size: The number of documents the resulting dataset should have.

The actual size of the dataset may be slightly larger if the normalized weights do not allow for even mixtures of the datasets.

Parameters:
  • datasets – A list of all datasets to combine together

  • sampling_weights – A list of weights to assign to each dataset in the input. Weights will be normalized across the whole list as a part of the sampling process. For example, if the normalized sampling weight for dataset 1 is 0.02, 2% ofthe total samples will be sampled from dataset 1. There are guaranteed to be math.ceil(normalized_weight_i * target_size) elements from dataset i in the final blend.

class nemo_curator.Shuffle(seed: int | None = None, npartitions: int | None = None, partition_to_filename: ~typing.Callable[[int], str] = <function default_filename>, filename_col: str = 'file_name')#
call(
dataset: DocumentDataset,
) DocumentDataset#

Performs an arbitrary operation on a dataset

Parameters:

dataset (DocumentDataset) – The dataset to operate on

class nemo_curator.DocumentSplitter(
separator: str,
text_field: str = 'text',
segment_id_field: str = 'segment_id',
)#

Splits documents into segments based on a separator. Each segment is a new document with an additional column indicating the segment id.

To restore the original document, ensure that each document has a unique id prior to splitting.

call(
dataset: DocumentDataset,
) DocumentDataset#

Splits the documents into segments based on the separator and adds a column indicating the segment id.

class nemo_curator.DocumentJoiner(
separator: str,
text_field: str = 'text',
segment_id_field: str = 'segment_id',
document_id_field: str = 'id',
drop_segment_id_field: bool = True,
max_length: int | None = None,
length_field: str | None = None,
)#

Joins documents that have a common id back into a single document. The order of the documents is dictated by an additional segment_id column. A maximum length can be specified to limit the size of the joined documents.

The joined documents are joined by a separator.

call(
dataset: DocumentDataset,
) DocumentDataset#

Joins the documents back into a single document while preserving all the original fields.