nemo_automodel.components.datasets.llm.hellaswag

View as Markdown

Module Contents

Classes

NameDescription
HellaSwagA dataset wrapper for the HellaSwag benchmark, tailored for single-turn supervised fine-tuning (SFT).

API

class nemo_automodel.components.datasets.llm.hellaswag.HellaSwag(
path_or_dataset,
tokenizer,
split = 'train',
num_samples_limit = None,
pad_to_max_length = True
)

A dataset wrapper for the HellaSwag benchmark, tailored for single-turn supervised fine-tuning (SFT).

This class loads and preprocesses the HellaSwag dataset using a tokenizer and a custom preprocessing pipeline for language model fine-tuning. The dataset consists of context and multiple-choice endings, where the goal is to choose the most plausible continuation.

dataset
= processor.process(raw_datasets, self)
nemo_automodel.components.datasets.llm.hellaswag.HellaSwag.__getitem__(
index
)

Get a processed example by index.

Parameters:

index
int

Index of the example.

Returns:

A tokenized and preprocessed example.

nemo_automodel.components.datasets.llm.hellaswag.HellaSwag.__len__()

Get the number of examples in the dataset.

Returns:

Length of the processed dataset.

nemo_automodel.components.datasets.llm.hellaswag.HellaSwag.get_context(
examples
)

Extracts the context part of each example.

Parameters:

examples
dict

A dictionary containing example data with a “ctx” key.

Returns:

list[str]: List of context strings.

nemo_automodel.components.datasets.llm.hellaswag.HellaSwag.get_target(
examples
)

Extracts the correct ending based on the label.

Parameters:

examples
dict

A dictionary with “endings” (list of strings) and “label” (index of correct ending).

Returns:

list[str]: The gold target strings based on the label index.