`nemo_automodel.datasets.llm.hellaswag`#

Module Contents#

Classes#

HellaSwag

A dataset wrapper for the HellaSwag benchmark, tailored for single-turn supervised fine-tuning (SFT).

API#

class nemo_automodel.datasets.llm.hellaswag.HellaSwag( path_or_dataset, tokenizer, split='train', num_samples_limit=None, trust_remote_code=True, )[source]#

A dataset wrapper for the HellaSwag benchmark, tailored for single-turn supervised fine-tuning (SFT).

This class loads and preprocesses the HellaSwag dataset using a tokenizer and a custom preprocessing pipeline for language model fine-tuning. The dataset consists of context and multiple-choice endings, where the goal is to choose the most plausible continuation.

.. attribute:: dataset

The processed dataset ready for model training or evaluation.

Type:: Dataset

Initialization

Initialize the HellaSwag dataset wrapper.

Parameters:

path_or_dataset (str or Dataset) – Path to the dataset or a HuggingFace Dataset object.
tokenizer (PreTrainedTokenizer) – The tokenizer used to process text.
split (str, optional) – Dataset split to use (e.g., ‘train’, ‘validation’). Defaults to ‘train’.
num_samples_limit (int, optional) – Maximum number of samples to load. Defaults to None.
trust_remote_code (bool, optional) – Whether to trust remote code. Defaults to True.

.. rubric:: Notes

If num_samples_limit is an integer, it limits the dataset size using slicing.

get_context(examples)[source]#

Extracts the context part of each example.

Parameters:: examples (dict) – A dictionary containing example data with a “ctx” key.
Returns:: List of context strings.
Return type:: list[str]

get_target(examples)[source]#

Extracts the correct ending based on the label.

Parameters:: examples (dict) – A dictionary with “endings” (list of strings) and “label” (index of correct ending).
Returns:: The gold target strings based on the label index.
Return type:: list[str]

__getitem__(index)[source]#

Get a processed example by index.

Parameters:: index (int) – Index of the example.
Returns:: A tokenized and preprocessed example.
Return type:: dict

__len__()[source]#

Get the number of examples in the dataset.

Returns:: Length of the processed dataset.
Return type:: int

nemo_automodel.datasets.llm.hellaswag#

Module Contents#

Classes#

API#

`nemo_automodel.datasets.llm.hellaswag`#