nemo_automodel.datasets.llm.hellaswag#

Module Contents#

Classes#

HellaSwag

A dataset wrapper for the HellaSwag benchmark, tailored for single-turn supervised fine-tuning (SFT).

API#

class nemo_automodel.datasets.llm.hellaswag.HellaSwag(
path_or_dataset,
tokenizer,
split='train',
num_samples_limit=None,
trust_remote_code=True,
)[source]#

A dataset wrapper for the HellaSwag benchmark, tailored for single-turn supervised fine-tuning (SFT).

This class loads and preprocesses the HellaSwag dataset using a tokenizer and a custom preprocessing pipeline for language model fine-tuning. The dataset consists of context and multiple-choice endings, where the goal is to choose the most plausible continuation.

.. attribute:: dataset

The processed dataset ready for model training or evaluation.

Type:

Dataset

Initialization

Initialize the HellaSwag dataset wrapper.

Parameters:
  • path_or_dataset (str or Dataset) – Path to the dataset or a HuggingFace Dataset object.

  • tokenizer (PreTrainedTokenizer) – The tokenizer used to process text.

  • split (str, optional) – Dataset split to use (e.g., ‘train’, ‘validation’). Defaults to ‘train’.

  • num_samples_limit (int, optional) – Maximum number of samples to load. Defaults to None.

  • trust_remote_code (bool, optional) – Whether to trust remote code. Defaults to True.

.. rubric:: Notes

If num_samples_limit is an integer, it limits the dataset size using slicing.

get_context(examples)[source]#

Extracts the context part of each example.

Parameters:

examples (dict) – A dictionary containing example data with a “ctx” key.

Returns:

List of context strings.

Return type:

list[str]

get_target(examples)[source]#

Extracts the correct ending based on the label.

Parameters:

examples (dict) – A dictionary with “endings” (list of strings) and “label” (index of correct ending).

Returns:

The gold target strings based on the label index.

Return type:

list[str]

__getitem__(index)[source]#

Get a processed example by index.

Parameters:

index (int) – Index of the example.

Returns:

A tokenized and preprocessed example.

Return type:

dict

__len__()[source]#

Get the number of examples in the dataset.

Returns:

Length of the processed dataset.

Return type:

int