nemo_rl.data.datasets.response_datasets.nemotron_cascade2_sft#

Module Contents#

Classes#

NemotronCascade2SFTMathDataset

Simple wrapper around the Nemotron-Cascade-2-SFT-Data math split.

API#

class nemo_rl.data.datasets.response_datasets.nemotron_cascade2_sft.NemotronCascade2SFTMathDataset(
split: str = 'train',
split_validation_size: float = 0.05,
seed: int = 42,
max_samples: int | None = None,
**kwargs,
)#

Bases: nemo_rl.data.datasets.raw_dataset.RawDataset

Simple wrapper around the Nemotron-Cascade-2-SFT-Data math split.

Loads the math subset of nvidia/Nemotron-Cascade-2-SFT-Data from HuggingFace. Each example already contains a messages field in OpenAI chat format (system / user / assistant turns), so no heavy reformatting is needed.

Parameters:
  • split – HuggingFace dataset split to load, default is “train”

  • split_validation_size – Fraction of data held out for validation when no dedicated validation split exists, default is 0.05

  • seed – Random seed used when shuffling before selecting max_samples and when creating the train/validation split, default is 42

  • max_samples – If set, randomly sample this many examples from the dataset before any train/validation split, default is None (use all)

Initialization

format_data(data: dict[str, Any]) dict[str, Any]#