`nemo_automodel.components.datasets.llm.column_mapped_text_instruction_dataset`#

Module Contents#

Classes#

`ColumnTypes`
`ColumnMappedTextInstructionDataset`	Generic instruction-tuning dataset that maps arbitrary column names.

Functions#

`make_iterable`	Utility that converts val into an iterator of strings.
`_str_is_hf_repo_id`	Check if a string is a valid huggingface dataset id.
`_load_dataset`	Load a dataset either from the Hugging Face Hub or from local JSON/JSONL files.
`_check_all_values_equal_length`	Check if all values in the sample are of the same length.

Data#

logger

API#

nemo_automodel.components.datasets.llm.column_mapped_text_instruction_dataset.logger#: ‘getLogger(…)’

class nemo_automodel.components.datasets.llm.column_mapped_text_instruction_dataset.ColumnTypes(*args, **kwds)#

Bases: enum.Enum

Context#: ‘context’

Question#: ‘question’

Answer#: ‘answer’

nemo_automodel.components.datasets.llm.column_mapped_text_instruction_dataset.make_iterable( val: Union[str, List[str]], ) → Iterator[str]#

Utility that converts val into an iterator of strings.

The helper accepts either a single string or a list of strings and yields its contents. This is handy when we want to treat the two cases uniformly downstream (e.g. when iterating over data_files that can be provided as either a single path or a collection of paths).

Parameters:: val – Either a single string or a list/tuple of strings.
Yields:: str – The individual strings contained in val.
Raises:: ValueError – If val is neither a string nor an iterable of strings.

nemo_automodel.components.datasets.llm.column_mapped_text_instruction_dataset._str_is_hf_repo_id(val: str) → bool#

Check if a string is a valid huggingface dataset id.

Parameters:: val – A string to check.
Returns:: True if the string is a valid huggingface dataset id, False otherwise.

nemo_automodel.components.datasets.llm.column_mapped_text_instruction_dataset._load_dataset( path_or_dataset_id: Union[str, List[str]], split: Optional[str] = None, streaming: bool = False, name: Optional[str] = None, )#

Load a dataset either from the Hugging Face Hub or from local JSON/JSONL files.

If path_or_dataset_id resembles a HF repo ID (i.e. of the form org/dataset and the path does not exist on the local filesystem), we defer to datasets.load_dataset directly. Otherwise, we assume the argument points to one or more local JSON/JSONL files and let datasets.load_dataset with the “json” script handle the parsing.

Parameters:

path_or_dataset_id – Either a HF dataset identifier (org/name) or a path / list of paths to local .json / .jsonl files.
split – Optional split to load when retrieving a remote dataset. This parameter is ignored for local files as the json script always returns a single split.
streaming – Whether to stream the dataset.
name – Optional name of the dataset configuration/subset to load

Returns:

The loaded dataset.

Return type:

datasets.Dataset

nemo_automodel.components.datasets.llm.column_mapped_text_instruction_dataset._check_all_values_equal_length( sample: Dict[str, List[int]], ) → bool#: Check if all values in the sample are of the same length.

class nemo_automodel.components.datasets.llm.column_mapped_text_instruction_dataset.ColumnMappedTextInstructionDataset( path_or_dataset_id: Union[str, List[str]], column_mapping: Dict[str, str], tokenizer, *, split: Optional[str] = None, name: Optional[str] = None, answer_only_loss_mask: bool = True, seq_length: Optional[int] = None, padding: Union[str, bool] = 'do_not_pad', truncation: Union[str, bool] = 'do_not_truncate', start_of_turn_token: Optional[str] = None, limit_dataset_samples: Optional[int] = None, )#

Bases: torch.utils.data.Dataset

Generic instruction-tuning dataset that maps arbitrary column names.

The class is intentionally lightweight: it simply loads the raw samples (either from HF or from local JSON/JSONL files) and remaps the columns so that downstream components can rely on a consistent field interface.

Optionally, if answer_only_loss_mask is requested, the dataset will also compute a loss_mask indicating which tokens should contribute to the loss (typically only those belonging to the assistant answer).

Initialization

Initialize the dataset.

Parameters:

path_or_dataset_id – The path or dataset id of the dataset.
column_mapping – The mapping of the columns.
tokenizer – The tokenizer to use.
split – The split of the dataset to load.
name – The name of the dataset configuration/subset to load
answer_only_loss_mask – Whether to compute the loss mask only on the answer tokens.
seq_length – The sequence length to use for padding.
start_of_turn_token – The token to use to indicate the start of a turn.
limit_dataset_samples – The number of samples to load from the dataset.

__len__() → int#

Returns the length of the dataset.

Returns:: The length of the dataset.
Raises:: RuntimeError – If streaming is enabled.

__getitem__(idx)#

Returns the item at the given index.

Parameters:: idx – The index of the item to return.
Returns:: A dictionary with the mapped columns.
Raises:: RuntimeError – If streaming is enabled.

_apply_tokenizer( sample: Dict[str, str], ) → Dict[str, List[int]]#

Tokenize a mapped sample and compute auxiliary fields.

If the tokenizer is provided:

If the tokenizer supports a chat template, the dataset will be tokenized in a conversation style.
Otherwise, the dataset will be tokenized in a simple prompt-completion style.

Parameters:: sample – A dictionary with the mapped columns.
Returns:: A dictionary with the tokenized columns.

nemo_automodel.components.datasets.llm.column_mapped_text_instruction_dataset#