nemo_automodel.components.datasets.llm.hellaswag#

Module Contents#

Classes#

HellaSwag

A dataset wrapper for the HellaSwag benchmark, tailored for single-turn supervised fine-tuning (SFT).

API#

class nemo_automodel.components.datasets.llm.hellaswag.HellaSwag(
path_or_dataset,
tokenizer,
split='train',
num_samples_limit=None,
trust_remote_code=True,
pad_to_max_length=True,
)#

A dataset wrapper for the HellaSwag benchmark, tailored for single-turn supervised fine-tuning (SFT).

This class loads and preprocesses the HellaSwag dataset using a tokenizer and a custom preprocessing pipeline for language model fine-tuning. The dataset consists of context and multiple-choice endings, where the goal is to choose the most plausible continuation.

.. attribute:: dataset

The processed dataset ready for model training or evaluation.

Type:

Dataset

Initialization

Initialize the HellaSwag dataset wrapper.

Parameters:
  • path_or_dataset (str or Dataset) – Path to the dataset or a HuggingFace Dataset object.

  • tokenizer (PreTrainedTokenizer) – The tokenizer used to process text.

  • split (str, optional) – Dataset split to use (e.g., ‘train’, ‘validation’). Defaults to ‘train’.

  • num_samples_limit (int, optional) – Maximum number of samples to load. Defaults to None.

  • trust_remote_code (bool, optional) – Whether to trust remote code. Defaults to True.

  • pad_to_max_length (bool, optional) – Whether to pad sequences to max length in the dataset. If False, sequences will have variable lengths and padding will be handled by the collate function. Defaults to True.

.. rubric:: Notes

If num_samples_limit is an integer, it limits the dataset size using slicing.

get_context(examples)#

Extracts the context part of each example.

Parameters:

examples (dict) – A dictionary containing example data with a “ctx” key.

Returns:

List of context strings.

Return type:

list[str]

get_target(examples)#

Extracts the correct ending based on the label.

Parameters:

examples (dict) – A dictionary with “endings” (list of strings) and “label” (index of correct ending).

Returns:

The gold target strings based on the label index.

Return type:

list[str]

__getitem__(index)#

Get a processed example by index.

Parameters:

index (int) – Index of the example.

Returns:

A tokenized and preprocessed example.

Return type:

dict

__len__()#

Get the number of examples in the dataset.

Returns:

Length of the processed dataset.

Return type:

int