nemo_automodel.components.datasets.llm.column_mapped_text_instruction_dataset#

Module Contents#

Classes#

ColumnTypes

ColumnMappedTextInstructionDataset

Generic instruction‐tuning dataset that maps arbitrary column names.

Functions#

make_iterable

Utility that converts val into an iterator of strings.

_str_is_hf_repo_id

Check if a string is a valid huggingface dataset id.

_load_dataset

Load a dataset either from the Hugging Face Hub or from local JSON/JSONL files.

_apply_tokenizer_with_chat_template

Tokenization path when the tokenizer supports a chat template.

_apply_tokenizer_plain

Tokenization path when chat_template is not available.

_has_chat_template

Check if the tokenizer supports a chat template.

_check_all_values_equal_length

Check if all values in the sample are of the same length.

API#

class nemo_automodel.components.datasets.llm.column_mapped_text_instruction_dataset.ColumnTypes(*args, **kwds)[source]#

Bases: enum.Enum

Context#

‘context’

Question#

‘question’

Answer#

‘answer’

nemo_automodel.components.datasets.llm.column_mapped_text_instruction_dataset.make_iterable(
val: Union[str, List[str]],
) Iterator[str][source]#

Utility that converts val into an iterator of strings.

The helper accepts either a single string or a list of strings and yields its contents. This is handy when we want to treat the two cases uniformly downstream (e.g. when iterating over data_files that can be provided as either a single path or a collection of paths).

Parameters:

val – Either a single string or a list/tuple of strings.

Yields:

str – The individual strings contained in val.

Raises:

ValueError – If val is neither a string nor an iterable of strings.

nemo_automodel.components.datasets.llm.column_mapped_text_instruction_dataset._str_is_hf_repo_id(val: str) bool[source]#

Check if a string is a valid huggingface dataset id.

Parameters:

val – A string to check.

Returns:

True if the string is a valid huggingface dataset id, False otherwise.

nemo_automodel.components.datasets.llm.column_mapped_text_instruction_dataset._load_dataset(
path_or_dataset_id: Union[str, List[str]],
split: Optional[str] = None,
streaming: bool = False,
)[source]#

Load a dataset either from the Hugging Face Hub or from local JSON/JSONL files.

If path_or_dataset_id resembles a HF repo ID (i.e. of the form org/dataset and the path does not exist on the local filesystem), we defer to datasets.load_dataset directly. Otherwise, we assume the argument points to one or more local JSON/JSONL files and let datasets.load_dataset with the “json” script handle the parsing.

Parameters:
  • path_or_dataset_id – Either a HF dataset identifier (org/name) or a path / list of paths to local .json / .jsonl files.

  • split – Optional split to load when retrieving a remote dataset. This parameter is ignored for local files as the json script always returns a single split.

Returns:

The loaded dataset.

Return type:

datasets.Dataset

nemo_automodel.components.datasets.llm.column_mapped_text_instruction_dataset._apply_tokenizer_with_chat_template(
tokenizer: transformers.PreTrainedTokenizer,
context: str,
question: str,
answer: str,
start_of_turn_token: Optional[str] = None,
answer_only_loss_mask: bool = True,
) Dict[str, List[int]][source]#

Tokenization path when the tokenizer supports a chat template.

Parameters:
  • context – The context of the sample.

  • question – The question of the sample.

  • answer – The answer of the sample.

Returns:

A dictionary with the tokenized columns.

nemo_automodel.components.datasets.llm.column_mapped_text_instruction_dataset._apply_tokenizer_plain(
tokenizer: transformers.PreTrainedTokenizer,
context: str,
question: str,
answer: str,
answer_only_loss_mask: bool = True,
) Dict[str, List[int]][source]#

Tokenization path when chat_template is not available.

Parameters:
  • context – The context of the sample.

  • question – The question of the sample.

  • answer – The answer of the sample.

Returns:

A dictionary with the tokenized columns.

nemo_automodel.components.datasets.llm.column_mapped_text_instruction_dataset._has_chat_template(
tokenizer: transformers.PreTrainedTokenizer,
) bool[source]#

Check if the tokenizer supports a chat template.

Parameters:

tokenizer – The tokenizer to check.

Returns:

True if the tokenizer supports a chat template, False otherwise.

nemo_automodel.components.datasets.llm.column_mapped_text_instruction_dataset._check_all_values_equal_length(
sample: Dict[str, List[int]],
) bool[source]#

Check if all values in the sample are of the same length.

class nemo_automodel.components.datasets.llm.column_mapped_text_instruction_dataset.ColumnMappedTextInstructionDataset(
path_or_dataset_id: Union[str, List[str]],
column_mapping: Dict[str, str],
tokenizer,
*,
split: Optional[str] = None,
streaming: bool = False,
answer_only_loss_mask: bool = True,
start_of_turn_token: Optional[str] = None,
)[source]#

Bases: torch.utils.data.Dataset

Generic instruction‐tuning dataset that maps arbitrary column names.

The class is intentionally lightweight: it simply loads the raw samples (either from HF or from local JSON/JSONL files) and remaps the columns so that downstream components can rely on a consistent field interface.

Optionally, if answer_only_loss_mask is requested, the dataset will also compute a loss_mask indicating which tokens should contribute to the loss (typically only those belonging to the assistant answer).

Initialization

Initialize the dataset.

Parameters:
  • path_or_dataset_id – The path or dataset id of the dataset.

  • column_mapping – The mapping of the columns.

  • tokenizer – The tokenizer to use.

  • split – The split of the dataset to load.

  • streaming – Whether to load the dataset in streaming mode.

  • answer_only_loss_mask – Whether to compute the loss mask only on the answer tokens.

  • start_of_turn_token – The token to use to indicate the start of a turn.

__len__() int[source]#

Returns the length of the dataset.

Returns:

The length of the dataset.

Raises:

RuntimeError – If streaming is enabled.

__getitem__(idx)[source]#

Returns the item at the given index.

Parameters:

idx – The index of the item to return.

Returns:

A dictionary with the mapped columns.

Raises:

RuntimeError – If streaming is enabled.

__iter__()[source]#

Iterate over the dataset yielding rows with the requested column mapping.

When streaming=True the underlying dataset is consumed lazily.

If the tokenizer is provided, it will be used to tokenize the dataset.

Returns:

An iterator over the dataset.

Raises:

RuntimeError – If streaming is enabled.

_apply_tokenizer(
sample: Dict[str, str],
) Dict[str, List[int]][source]#

Tokenize a mapped sample and compute auxiliary fields.

If the tokenizer is provided:

  • If the tokenizer supports a chat template, the dataset will be tokenized in a conversation style.

  • Otherwise, the dataset will be tokenized in a simple prompt-completion style.

Parameters:

sample – A dictionary with the mapped columns.

Returns:

A dictionary with the tokenized columns.