nemo_automodel.datasets.llm.hellaswag
#
Module Contents#
Classes#
A dataset wrapper for the HellaSwag benchmark, tailored for single-turn supervised fine-tuning (SFT). |
API#
- class nemo_automodel.datasets.llm.hellaswag.HellaSwag(
- path_or_dataset,
- tokenizer,
- split='train',
- num_samples_limit=None,
- trust_remote_code=True,
A dataset wrapper for the HellaSwag benchmark, tailored for single-turn supervised fine-tuning (SFT).
This class loads and preprocesses the HellaSwag dataset using a tokenizer and a custom preprocessing pipeline for language model fine-tuning. The dataset consists of context and multiple-choice endings, where the goal is to choose the most plausible continuation.
.. attribute:: dataset
The processed dataset ready for model training or evaluation.
- Type:
Dataset
Initialization
Initialize the HellaSwag dataset wrapper.
- Parameters:
path_or_dataset (str or Dataset) – Path to the dataset or a HuggingFace Dataset object.
tokenizer (PreTrainedTokenizer) – The tokenizer used to process text.
split (str, optional) – Dataset split to use (e.g., ‘train’, ‘validation’). Defaults to ‘train’.
num_samples_limit (int, optional) – Maximum number of samples to load. Defaults to None.
trust_remote_code (bool, optional) – Whether to trust remote code. Defaults to True.
.. rubric:: Notes
If num_samples_limit is an integer, it limits the dataset size using slicing.
- get_context(examples)[source]#
Extracts the context part of each example.
- Parameters:
examples (dict) – A dictionary containing example data with a “ctx” key.
- Returns:
List of context strings.
- Return type:
list[str]
- get_target(examples)[source]#
Extracts the correct ending based on the label.
- Parameters:
examples (dict) – A dictionary with “endings” (list of strings) and “label” (index of correct ending).
- Returns:
The gold target strings based on the label index.
- Return type:
list[str]