nemo_rl.data.datasets.processed_dataset
#
Module Contents#
Classes#
Dataset for processing single or multi-task data with task-specific tokenization and processing. |
Data#
API#
- nemo_rl.data.datasets.processed_dataset.TokenizerType#
None
- class nemo_rl.data.datasets.processed_dataset.AllTaskProcessedDataset(
- dataset: datasets.Dataset | Any,
- tokenizer: nemo_rl.data.datasets.processed_dataset.TokenizerType,
- default_task_data_spec: nemo_rl.data.interfaces.TaskDataSpec,
- task_data_processors: dict[str, tuple[nemo_rl.data.interfaces.TaskDataSpec, nemo_rl.data.interfaces.TaskDataProcessFnCallable]] | nemo_rl.data.interfaces.TaskDataProcessFnCallable,
- max_seq_length: Optional[int] = None,
Dataset for processing single or multi-task data with task-specific tokenization and processing.
- Parameters:
dataset – Input dataset containing raw data
tokenizer – Tokenizer for text processing
default_task_data_spec – Default task processing specifications. In the case of single-task, this is the spec used for processing all entries. In the case of multi-task, any values not specified in the task-specific specs will be taken from the default spec.
task_data_processors – Either a single TaskDataProcessFnCallable for single-task, or a dict mapping task names to (TaskDataSpec, TaskDataProcessFnCallable) for multi-task
max_seq_length – Maximum sequence length for tokenized outputs
Initialization
- __len__() int #
- encode_single(
- text: Union[str, list[str]],
Takes either a single string or a list of strings that represent multiple turns for the same conversation.
Returns a single (concatenated) list of tokenized ids and the length of the tokenized ids.
- __getitem__(idx: int) nemo_rl.data.interfaces.DatumSpec #
Return a single prompt.