`nemo_rl.data.datasets.processed_dataset`#

Module Contents#

Dataset for processing single or multi-task data with task-specific tokenization and processing.

class nemo_rl.data.datasets.processed_dataset.AllTaskProcessedDataset( dataset: datasets.Dataset | Any, tokenizer: nemo_rl.data.datasets.processed_dataset.TokenizerType, default_task_data_spec: nemo_rl.data.interfaces.TaskDataSpec, task_data_processors: dict[str, tuple[nemo_rl.data.interfaces.TaskDataSpec, nemo_rl.data.interfaces.TaskDataProcessFnCallable]] | nemo_rl.data.interfaces.TaskDataProcessFnCallable, max_seq_length: Optional[int] = None, )#

Dataset for processing single or multi-task data with task-specific tokenization and processing.

Parameters:

dataset – Input dataset containing raw data
tokenizer – Tokenizer for text processing
default_task_data_spec – Default task processing specifications. In the case of single-task, this is the spec used for processing all entries. In the case of multi-task, any values not specified in the task-specific specs will be taken from the default spec.
task_data_processors – Either a single TaskDataProcessFnCallable for single-task, or a dict mapping task names to (TaskDataSpec, TaskDataProcessFnCallable) for multi-task
max_seq_length – Maximum sequence length for tokenized outputs

Initialization

encode_single( text: Union[str, list[str]], ) → tuple[list[int] | torch.Tensor, int]#

Takes either a single string or a list of strings that represent multiple turns for the same conversation.

Returns a single (concatenated) list of tokenized ids and the length of the tokenized ids.

__getitem__(idx: int) → nemo_rl.data.interfaces.DatumSpec#: Return a single prompt.