nemo_rl.data.datasets.response_datasets.response_dataset#

Module Contents#

Classes#

ResponseDataset

Dataset class for response data which can be loaded from a JSON file.

API#

class nemo_rl.data.datasets.response_datasets.response_dataset.ResponseDataset(
data_path: str,
input_key: str = 'input',
output_key: str = 'output',
split: Optional[str] = None,
split_validation_size: float = 0,
seed: int = 42,
**kwargs,
)#

Bases: nemo_rl.data.datasets.raw_dataset.RawDataset

Dataset class for response data which can be loaded from a JSON file.

This class handles loading of response data for SFT and RL training. The input JSONL files should contain valid JSON objects formatted like this: { input_key: str, # The input prompt/context output_key: str, # The output response/answer }

Parameters:
  • data_path – Path to the dataset JSON file

  • input_key – Key for the input text, default is “input”

  • output_key – Key for the output text, default is “output”

  • split – Optional split name for the dataset, used for HuggingFace datasets

  • split_validation_size – Size of the validation data, default is 0

  • seed – Seed for train/validation split when split_validation_size > 0, default is 42

Initialization

format_data(data: dict[str, Any]) dict[str, Any]#