nemo_rl.data.datasets.response_datasets.nemotron_cascade2_sft#
Module Contents#
Classes#
Simple wrapper around the Nemotron-Cascade-2-SFT-Data math split. |
API#
- class nemo_rl.data.datasets.response_datasets.nemotron_cascade2_sft.NemotronCascade2SFTMathDataset(
- split: str = 'train',
- split_validation_size: float = 0.05,
- seed: int = 42,
- max_samples: int | None = None,
- **kwargs,
Bases:
nemo_rl.data.datasets.raw_dataset.RawDatasetSimple wrapper around the Nemotron-Cascade-2-SFT-Data math split.
Loads the
mathsubset ofnvidia/Nemotron-Cascade-2-SFT-Datafrom HuggingFace. Each example already contains amessagesfield in OpenAI chat format (system / user / assistant turns), so no heavy reformatting is needed.- Parameters:
split – HuggingFace dataset split to load, default is “train”
split_validation_size – Fraction of data held out for validation when no dedicated validation split exists, default is 0.05
seed – Random seed used when shuffling before selecting max_samples and when creating the train/validation split, default is 42
max_samples – If set, randomly sample this many examples from the dataset before any train/validation split, default is None (use all)
Initialization
- format_data(data: dict[str, Any]) dict[str, Any]#