nemo_rl.data.datasets.processed_dataset#

Module Contents#

Classes#

AllTaskProcessedDataset

Dataset for processing single or multi-task data with task-specific tokenization and processing.

Data#

API#

nemo_rl.data.datasets.processed_dataset.TokenizerType#

None

class nemo_rl.data.datasets.processed_dataset.AllTaskProcessedDataset(
dataset: datasets.Dataset | Any,
tokenizer: nemo_rl.data.datasets.processed_dataset.TokenizerType,
default_task_data_spec: nemo_rl.data.interfaces.TaskDataSpec,
task_data_processors: dict[str, tuple[nemo_rl.data.interfaces.TaskDataSpec, nemo_rl.data.interfaces.TaskDataProcessFnCallable]] | nemo_rl.data.interfaces.TaskDataProcessFnCallable,
max_seq_length: Optional[int] = None,
)#

Dataset for processing single or multi-task data with task-specific tokenization and processing.

Parameters:
  • dataset – Input dataset containing raw data

  • tokenizer – Tokenizer for text processing

  • default_task_data_spec – Default task processing specifications. In the case of single-task, this is the spec used for processing all entries. In the case of multi-task, any values not specified in the task-specific specs will be taken from the default spec.

  • task_data_processors – Either a single TaskDataProcessFnCallable for single-task, or a dict mapping task names to (TaskDataSpec, TaskDataProcessFnCallable) for multi-task

  • max_seq_length – Maximum sequence length for tokenized outputs

Initialization

__len__() int#
encode_single(
text: Union[str, list[str]],
) tuple[list[int] | torch.Tensor, int]#

Takes either a single string or a list of strings that represent multiple turns for the same conversation.

Returns a single (concatenated) list of tokenized ids and the length of the tokenized ids.

__getitem__(idx: int) nemo_rl.data.interfaces.DatumSpec#

Return a single prompt.