nemo_automodel.components.datasets.llm.column_mapped_text_instruction_dataset

Module Contents

Classes

Name	Description
`ColumnMappedTextInstructionDataset`	Generic instruction-tuning dataset that maps arbitrary column names.
`ColumnTypes`	Supported logical column roles for text instruction datasets.

Functions

Name	Description
`_check_all_values_equal_length`	Check if all values in the sample are of the same length.
`_load_dataset`	Load a dataset either from the Hugging Face Hub or from local JSON/JSONL files.
`_str_is_hf_repo_id`	Check if a string is a valid huggingface dataset id.
`make_iterable`	Utility that converts val into an iterator of strings.

Data

logger

API

class nemo_automodel.components.datasets.llm.column_mapped_text_instruction_dataset.ColumnMappedTextInstructionDataset(
    path_or_dataset_id: typing.Union[str, typing.List[str]],
    column_mapping: typing.Dict[str, str],
    tokenizer,
    split: typing.Optional[str] = 'train',
    name: typing.Optional[str] = None,
    answer_only_loss_mask: bool = True,
    seq_length: typing.Optional[int] = None,
    padding: typing.Union[str, bool] = 'do_not_pad',
    truncation: typing.Union[str, bool] = 'do_not_truncate',
    limit_dataset_samples: typing.Optional[int] = None,
    use_hf_chat_template: bool = False
)

Bases: Dataset

Generic instruction-tuning dataset that maps arbitrary column names.

The class is intentionally lightweight: it simply loads the raw samples (either from HF or from local JSON/JSONL files) and remaps the columns so that downstream components can rely on a consistent field interface.

Optionally, if answer_only_loss_mask is requested, the dataset will also compute a loss_mask indicating which tokens should contribute to the loss (typically only those belonging to the assistant answer).

dataset

nemo_automodel.components.datasets.llm.column_mapped_text_instruction_dataset.ColumnMappedTextInstructionDataset.__getitem__(
    idx
)

Returns the item at the given index.

Parameters:

idx

The index of the item to return.

Returns:

A dictionary with the mapped columns.

nemo_automodel.components.datasets.llm.column_mapped_text_instruction_dataset.ColumnMappedTextInstructionDataset.__iter__() -> typing.Iterator[typing.Dict[str, typing.List[int]]]

nemo_automodel.components.datasets.llm.column_mapped_text_instruction_dataset.ColumnMappedTextInstructionDataset.__len__() -> int

Returns the length of the dataset.

Returns: int

The length of the dataset.

nemo_automodel.components.datasets.llm.column_mapped_text_instruction_dataset.ColumnMappedTextInstructionDataset._apply_tokenizer(
    sample: typing.Dict[str, str]
) -> typing.Dict[str, typing.List[int]]

Tokenize a mapped sample and compute auxiliary fields.

If the tokenizer is provided:

If the tokenizer supports a chat template, the dataset will be tokenized in a conversation style.
Otherwise, the dataset will be tokenized in a simple prompt-completion style.

Parameters:

sample

Dict[str, str]

A dictionary with the mapped columns.

Returns: Dict[str, List[int]]

A dictionary with the tokenized columns.

class nemo_automodel.components.datasets.llm.column_mapped_text_instruction_dataset.ColumnTypes

Bases: enum.Enum

Supported logical column roles for text instruction datasets.

Answer

= 'answer'

Context

= 'context'

Question

= 'question'

nemo_automodel.components.datasets.llm.column_mapped_text_instruction_dataset._check_all_values_equal_length(
    sample: typing.Dict[str, typing.List[int]]
) -> bool

Check if all values in the sample are of the same length.

nemo_automodel.components.datasets.llm.column_mapped_text_instruction_dataset._load_dataset(
    path_or_dataset_id: typing.Union[str, typing.List[str]],
    split: typing.Optional[str] = None,
    streaming: bool = False,
    name: typing.Optional[str] = None
)

Load a dataset either from the Hugging Face Hub or from local JSON/JSONL files.

If path_or_dataset_id resembles a HF repo ID (i.e. of the form org/dataset and the path does not exist on the local filesystem), we defer to datasets.load_dataset directly. Otherwise, we assume the argument points to one or more local JSON/JSONL files and let datasets.load_dataset with the “json” script handle the parsing.

Parameters:

path_or_dataset_id

Union[str, List[str]]

Either a HF dataset identifier (org/name) or a path / list of paths to local .json / .jsonl files.

split

Optional[str]Defaults to None

Optional split to load when retrieving a remote dataset. This parameter is ignored for local files as the json script always returns a single split.

streaming

boolDefaults to False

Whether to stream the dataset.

name

Optional[str]Defaults to None

Optional name of the dataset configuration/subset to load

Returns:

datasets.Dataset: The loaded dataset.

nemo_automodel.components.datasets.llm.column_mapped_text_instruction_dataset._str_is_hf_repo_id(
    val: str
) -> bool

Check if a string is a valid huggingface dataset id.

Parameters:

val

str

A string to check.

Returns: bool

True if the string is a valid huggingface dataset id, False otherwise.

nemo_automodel.components.datasets.llm.column_mapped_text_instruction_dataset.make_iterable(
    val: typing.Union[str, typing.List[str]]
) -> typing.Iterator[str]

Utility that converts val into an iterator of strings.

The helper accepts either a single string or a list of strings and yields its contents. This is handy when we want to treat the two cases uniformly downstream (e.g. when iterating over data_files that can be provided as either a single path or a collection of paths).

Parameters:

val

Union[str, List[str]]

Either a single string or a list/tuple of strings.

Raises:

ValueError: If val is neither a string nor an iterable of strings.

nemo_automodel.components.datasets.llm.column_mapped_text_instruction_dataset.logger = logging.getLogger(__name__)